Ours & Hippy — le blogOurs & Hippyourshippy@huoc.orgtag:blog.huoc.org,2009:atom2009-06-25T12:48:34+02:00tag:blog.huoc.org,2009:posts/10
xmlsed preview: writing XML
2009-06-25T12:48:34+02:002009-06-25T12:48:34+02:00Nhat Minh Lê (rz0)
<p>One might be tempted to think that, while parsing can be a complex
task involving various competing models, printing is straightforward
and should be the easy part of the job. This is not quite true
however, as David and I have been in disagreement for some time
already over the exact semantics and syntax of a potential XML
printing subsystem of xmltools.
</p><p>This post is all about the design of the printing language in xmltools
and the underlying model. I also present at the end my latest hybrid
proposal, that I’ve submitted to David for approval earlier today, and
which I hope will bring satisfaction to all parties. If you have some
opinion on this topic, do not hesitate to e-mail me! (My e-mail is my
first name followed by my last name with all spaces replaced by dots;
it’s written in French in the footer so you might not be able to
decode it.)
</p><p>The basic problem can be described as follows: we have a source tree,
the input document and we want to transform it by adding or removing
nodes in some places. Simple? Not quite.
</p><h3>Select & transform
</h3><p>The first design I proposed is based on <b>sed(1)</b>: we select some
node in the tree using a tree pattern (same as in my xmlgrep) and then
transform it using some actions: insert, append, replace, etc. Only
these needed to be adapted to XML.
</p><p>My idea was then to use PYX/ESIS-inspired actions, which also happen
to model closely my internal API. These consisted of a one-character
prefix indicating the thing to print, followed by contents (depending
on the prefix). Prefixes were:
</p><table><tr><td><code>(</code> </td><td>opening tag
</td></tr><tr><td><code>)</code> </td><td>closing tag
</td></tr><tr><td><code>@</code> </td><td>attribute
</td></tr><tr><td><code>.</code> </td><td>text
</td></tr><tr><td><code>!</code> </td><td>comment
</td></tr><tr><td><code>?</code> </td><td>junk
</td></tr></table><p>This is a very simple approach but David objected that it was
unnatural to write XML using some language which was not XML.
</p><dl><dt><b>Advantages</b></dt><dd><ul><li>Very simple and straightforward.
</li><li>Completely symmetrical to the way patterns are matched.
</li></ul></dd><dt><b>Disadvantages</b></dt><dd><ul><li>Code used to write XML does not look like the output.
</li><li>Not intuitive to read as data.
</li></ul></dd></dl><h3>Decorate & pull
</h3><p>Another approach, which is used most notably by XSLT, is to have
a template, which itself is written as XML, in which some elements
from the source document are <em>pulled</em>.
</p><p>Let me say it upfront in case anybody is not clear on this: pulling is
<em>not</em> a stream technique, so this model as is cannot be implemented
easily (some subsets of XSLT as stream transformations have been
researched, but since it is from the start not a stream-oriented
technique, it’s complex and IMHO not worth it).
</p><p>David first suggested using a template as a way of having
printing-heavy script actually look like their output. He observed
plain XML could be written directly while some special elements,
e.g. <code>xmlsed:addattr</code> could be used to print things that cannot be
represented directly in XML. Indeed, there is no XML syntax to write
just an attribute without the surrounding tag, yet there are cases
where we might want to add an attribute to an existing element.
</p><p>I criticized this idea based on the fact that reserving a namespace
was intrusive and that using XML to represent scripting elements was
too verbose.
</p><p>But the main point was that pull-style transformation was never an
option.
</p><dl><dt><b>Advantages</b></dt><dd><ul><li>XML in script remains in output.
</li><li>Very intuitive and easy to read.
</li></ul></dd><dt><b>Disadvantages</b></dt><dd><ul><li>Model is not adapted to streaming.
</li><li>Too verbose.
</li></ul></dd></dl><h3>Towards a hybrid model: XML syntax for select & transform
</h3><p>Given the stream-unfriendly nature of the template-based model, David
came up with the idea to represent actions of the select & transform
model as XML: XML opening and closing tags could be used to represent
the corresponding printing actions (respectively <code>(</code> and <code>)</code>) but for
the rest, we were still stuck with the way-too-verbose XSLT-like
<code><xmlsed:whatever></code> elements.
</p><p>I was still very dissatisfied with the verbosity of having to write
<code><xmlsed:whatever/></code> just to add an attribute or specify where to
insert the current node (before, after, inside the plain XML part),
but the concept itself, of using a template, was quite nice.
</p><dl><dt><b>Advantages</b></dt><dd><ul><li>XML in script remains in output.
</li><li>Very intuitive and easy to read.
</li><li>Still reasonably simple to implement.
</li></ul></dd><dt><b>Disadvantages</b></dt><dd><ul><li>Still too verbose.
</li></ul></dd></dl><h3>A hybrid model
</h3><p>Today, after more brainstorming, I finally decided on the following
model, and unless David is really against it (which I seriously
doubt), I think it will make it into xmlsed one way or the other: the
idea is the same as that of the previous attempt, use a template which
actually describes actions inferred from the structure of XML itself.
</p><p>Only, this time, we do not use special elements to indicate actions
anymore, <em>everything</em> is inferred from the structure. The trick to do
this is to have a virtual element represent the current node; it can
have any name (it will be user-settable), let’s call it <code>:</code> by
default. We can place it wherever we want in the template. And to add
an attribute (resp. a child), we just have to add an attribute
(resp. a child) to that virtual element. The syntax remains reasonably
compact (even <em>optimally compact</em> given we’re using XML as the basis
for our template) and the script now very closely models the output,
even more than XSLT scripts!
</p><p>The idea was inspired from <b>xargs(1)</b> and its <code>-I</code> option. Again,
Unix saved the day… well, not quite, but still, I think it is a nice
and practical compromise. :)
</p>
tag:blog.huoc.org,2009:posts/9
More benchmarks: the bitvopt branch
2009-06-21T11:28:10+02:002009-06-21T11:28:10+02:00Nhat Minh Lê (rz0)
<p>Those who have been keeping an eye on my Git repository have surely
noticed the recent update of the <code>bitvopt</code> branch, which contains
alternate implementations of the core routines of the matching engine
(<code>match.c</code>): <code>followsibling</code> and <code>followparent</code>. The <code>bitvopt</code> code
replaces the direct tests and propagation entirely with bit vector
operations (masking, shifting, etc.).
</p><p>The main advantage of this approach is <em>vectorization.</em> Such
operations are easily done on a single unit (the type of which is
defined in <code>bitv.h</code> and somewhat configurable with the <code>USE_STDINT</code>
macro) and can be generalized to loops that carry no dependencies
between iterations. These loops can then be auto-vectorized by
a sufficiently smart compiler. The current code in <code>bitvopt</code> is
understood and vectorized by both GCC 4 with <code>-ftree-vectorize</code> and
ICC 10.1. Beware, ICC 10.0 fails because of the <code>restrict</code> qualifiers,
which are necessary; also note that, on x86, you <em>have to</em> pass
<code>USE_STDINT</code> in order to select a type bigger than <code>unsigned char</code>, or
else, SIMD shift operations won’t be available. To see the
vectorization reports, you can use <code>-ftree-vectorizer-verbose=5</code> for
GCC and <code>-vec-report</code> for ICC.
</p><p>The results are not quite satisfactory and the branch was not merged
into <code>master</code>:
</p><div class="Affiche"><img src="http://blog.huoc.org/media/9-1-vect.png" alt="GCC auto-vectorization not doing well…" />
</div><p>On the chart, GCC has vectorization disabled by default while ICC
always has vectorization (even though I have not specified <code>-vect</code>
anywhere). The measurements have been done on my old Pentium 4, with
<code>-O2</code> and appropriate architecture flags, with GCC 4.1.2 and ICC
10.1. The values are averages computed over 100 runs of the following
pattern on the same dictionary as before:
</p><pre><code>$ xmlgrep 'p[hw/?="Ab\"a*cus"]#'
</code></pre><p>I must say I can’t explain why operations on native integers (the
<code>-int</code> builds) are slower than those on bytes. As for the relative
gain of non-vectorized code over vectorized one, this is most probably
because the pattern is not long enough (less than one base unit) to
take advantage of SIMD so we lose in the overhead introduced with
vectorization (all the more so as data may be unaligned on entry of
these functions). Lastly, it seems auto-vectorization in GCC really is
not advantageous right now, since the non-vectorized versions tend to
outperform their vectorized counterparts when compiled with GCC (and
this is not the case with ICC).
</p><p>Anyway, this was just a side experiment of mine; work on the main
tools continue: plans for more features in the matcher (on request,
case insensitivity, arithmetic predicates, maybe some other things) as
well as the introduction of xmlsed are underway.
</p>
tag:blog.huoc.org,2009:posts/8
xmlgrep by examples: playing with Atom
2009-06-17T14:23:11+02:002009-06-22T14:30:22+02:00Nhat Minh Lê (rz0)
<p><i>Edited 22.06.</i> <strong>Incompatible syntax changes</strong> and more examples.
</p><p>Since Matthew Sporleder <a class="extern" href="http://mail-index.netbsd.org/tech-userlevel/2009/06/16/msg002258.html">on tech-userlevel</a> has (implicitly)
suggested that xmlgrep could be used to dissect Atom feeds, but was
a bit lost at how to do it exactly, I thought I’d post a little demo
of a console session playing with an Atom file and xmlgrep (as well as
some other command-line tools).
</p><p>First, let’s fetch the feed:
</p><pre><code>$ ftp http://mspo.com/blog/atom.xml
[...]
</code></pre>As a side effect of xmlgrep, we might want to indent the XML file to
make it human-readable:
<pre><code>$ xmlgrep '*' atom.xml | more
</code></pre>List all posts in the NetBSD category with their IDs:
<pre><code>$ xmlgrep -x 'entry[category/@term=NetBSD]/(title|id)/.' atom.xml
tag:blogger.com,1999:blog-6347225410141611306.post-1131641169617411392
NetBSD quotas - quickstart
tag:blogger.com,1999:blog-6347225410141611306.post-1939815769827620970
NetBSD device drivers - easier than you might think
[...]
</code></pre><p>Those of you yet unfamiliar with the syntax might have some trouble
understanding. The previous pattern could be read "select a text child
of an id or title element, itself a child of an entry element, which
contains a category element which has an attribute child term equal to
NetBSD." Step by step, you should notice that <code>a[b]</code> is read "a such
that b", <code>|</code> stands for "or", <code>/</code> for "child of", <code>@</code> for "attribute",
<code>.</code> for "text", and the braces are used for grouping purposes.
</p><p>Now, let’s select a post by ID:
</p><pre><code>$ xmlgrep -x 'entry[id/.~"post-1939"]#' atom.xml
</code></pre>Or, select a post by title and view its contents using w3m:
<pre><code>$ xmlgrep -x 'entry[title/.~"device drivers"]/content/.' atom.xml |
> sed -e 's/&lt;/</g' -e 's/&gt;/>/g' -e 's/&amp;/\&/g' |
> w3m -T text/html
</code></pre><p>As a side note, I should mention that up to now we have used
subpatterns quite a lot. This is because the Atom feed specification
does not force an order (or does it?) on the children of <code>entry</code>
elements. With more precise knowledge of the order of elements
relative to each other, we could have optimized the pattern to use <code>%</code>
and <code>%%</code> where possible. Subpatterns <em>are</em> costly, but for data sets
this size, we probably don’t care much.
</p><p>Let’s print all entry titles which date from March 2009 using the fact
that we <em>know</em> the <code>updated</code> element comes before the <code>title</code> one:
</p><pre><code>$ xmlgrep -x 'entry/updated[.~"^2009-03"]%%title/.' atom.xml
</code></pre>A friend of mine told me it would be useful to have arithmetic
predicates. I think they will feature in xmltools sooner or later, but
even without them, it is still possible to do some simple statistics,
by combining the results with <b>awk(1)</b>, for example. The following
one-liner counts the number of posts that have no older than March:
<pre><code>$ xmlgrep -x 'entry/updated/.' atom.xml | awk -F - '$2>=3' | wc -l
</code></pre><p>That’s it; I hope this will help people who want to get started with
xmlgrep. If you have other good examples you’d want me to elaborate,
do not hesitate to send me a mail!
</p>
tag:blog.huoc.org,2009:posts/7
Improving performances of xmlgrep
2009-06-16T12:28:06+02:002009-06-16T12:28:06+02:00Nhat Minh Lê (rz0)
<p>After spending some days fixing bugs in the xmlgrep matcher code,
I took some time again to look at the performance <i>issues.</i> The
previous measurements showed xmlgrep was slower than xmlstarlet, even
on patterns without subpatterns, which are theoretically optimal for
my implementation. That really bothered me.
</p><p>Profiling seemed to show xmlgrep spent most of its time in bit vector
operations – that much was expected. But my vain attempt (Git branch
<code>bitvopt</code>) at replacing tests (the <code>FOREACHVSTATE</code> macro) with bit
vector operations proved actually slower than the master version.
</p><p>Then I had a talk with David Young, my mentor, and we concluded that
both the input mechanism and the memory allocation pattern were
suboptimal.
</p><p>About the input problem, I’ve replaced the old input feeder, which
used <b>fgets(3)</b>, with direct <b>read(2)</b> calls, and this has indeed
shaved some seconds off the counter. This is especially noticeable for
patterns without subpatterns, since I/O accounts for more in that
case.
</p><p>The memory allocation issue was more serious. Basically, a lot of bit
vectors and subcontext frames were needlessly allocated then almost
immediately freed afterwards. This is due to the following construct
commonly found in patterns (maybe I should optimize this one case
specifically): patterns which start with an anchored subpattern,
typically something like <code>a[b]</code> or <code>{a%b}</code>. This kind of patterns
causes the matcher to try to match <em>every node</em> in the tree against
the subpattern yet fail almost all the time (on all those nodes that
do not match <code>a</code>). Such a failed subpattern match entails a call to
<code>pushsub</code> followed by a call to <code>popsub</code>, which are responsible for
the management of the subcontext stack.
</p><p>I’ve switched to pool allocators for both internal tree nodes and bit
vectors. This has drastically reduced the number of allocations made
during a run. David also pointed out that I could use a pre-allocated
vector instead of a linked list for the subcontext stack, and
I remembered one of the nice properties of my implementation: the
number of subcontext is bounded by the maximal nesting depth of
subpatterns. He also suggested I should use copy-on-write bit vectors,
but that proved unnecessary as I now use pointer swapping instead of
copying then erasing (which was quite stupid in the first place,
I agree).
</p><p>All these changes were developed in the <code>memopt2</code> branch overnight and
are now in <code>master</code>. The same measurements as before show some
exciting improvements, most notably the fact that we are now faster
than libxml on patterns that do not use look-ahead! Basically, <em>on
this example,</em> we are about 50% faster than before.
</p><div class="Affiche"><img src="http://blog.huoc.org/media/7-1-faster.png" alt="Revisited speed comparison" />
</div>
tag:blog.huoc.org,2009:posts/6
xmlgrep toy benchmarks
2009-06-12T01:06:18+02:002009-06-12T01:06:18+02:00Nhat Minh Lê (rz0)
As xmlgrep is approaching a pre-pre-alpha release (something that is
still very experimental, but nonetheless working, somewhat :), I’ve
been doing some basic tests on it, including one simple benchmark. The
results are, IMHO, encouraging. The benchmark pits xmlgrep against
equivalent existing software:
<ul><li>the selection tool from the <a class="extern" href="http://xmlstar.sourceforge.net/">xmlstarlet project</a> (available in
pkgsrc as <code>textproc/xmlstarlet</code>);
</li><li>another xmlgrep from the Perl <a class="extern" href="http://xmltwig.com/">Twig</a> package (<code>textproc/p5-XML-Twig</code>
in pkgsrc);
</li><li>and GNU grep, though it does not really accomplish the same, from
the base NetBSD-5 distribution, as a reference.
</li></ul><p>Since each tool accepts its own patterns and treats them differently,
I’ve tried my best to get meaningful results with each, but obviously,
the behaviors obtained differ somewhat.
</p><p>The <code>tests/memplot</code> script was used to make the sampling. The data
were plotted using <a class="extern" href="http://www.gnuplot.info/">Gnuplot</a>. The following commands were run:
</p><pre><code>$ xmlgrep 'hw/?="Ab\"a*cus"' d.xml
$ xmlgrep 'hw[?="Ab\"a*cus"]' d.xml
$ xml sel -t -c "//hw[text()='Ab&quot;a*cus']" d.xml
$ xml_grep "hw[string()='Ab\"a*cus']" d.xml
$ grep 'Ab"a\*cus' d.xml
</code></pre><p>The goal was to retrieve the <code><hw></code> element which contains the
<code>Ab"a*cus</code> string from a 53M dictionary, retrieved from
<a class="extern" href="http://www.ibiblio.org/webster/">http://www.ibiblio.org/webster/</a> and merged into a single file.
</p><p>The first two lines correspond to two different invocations of
xmlgrep, which do not do exactly the same thing, and with the first
being the <em>good one,</em> the other inefficiently abusing the look-ahead
mechanism. Yet, it was included for comparison purposes and maybe to
be fair to the XPath-based alternatives, which have to rely on
a predicate.
</p><p>Results follow; the second and third pictures are just zooms which
omit one or another of the candidates.
</p><div class="Avertissement"><p>I <em>know</em> these results don’t mean much, so don’t take them too
seriously. I’ve only tested a single simple pattern. Actually,
I didn’t really intend to make a comparison, at first, I just wanted
to make sure xmlgrep doesn’t leak or otherwise misuse memory. But
since the sampling code was there, I thought it’d be fun to do some
additional measurements.
</p></div><div class="Affiche"><img src="http://blog.huoc.org/media/6-1-bench.png" alt="Overall benchmark" />
</div><div class="Affiche"><img src="http://blog.huoc.org/media/6-2-lomem.png" alt="Low-memory candidates benchmark" />
</div><div class="Affiche"><img src="http://blog.huoc.org/media/6-3-fast.png" alt="Fast candidates benchmark" />
</div><p>As expected, xmlstarlet, which uses a full DOM model requires a lot of
memory to do its work (more than 400M!). But it is the fastest.
</p><p>My xmlgrep, when used right, is still nearly 42% slower than
xmlstarlet (from above 4s to below 6s), but I think the times remain
reasonable; memory-wise, it is also the lightest, with a constant 2.9M
in use throughout the whole run. Even GNU grep requires about 2.8M.
</p><p>Surprisingly, the Twig xmlgrep is not too big a memory killer, though
its memory usage increases over time, by steps (though it looks linear
on the long run). This appears somewhat strange to me; why some nodes
should be pruned from its internal DOM-like tree while others seem to
remain for the duration of the program. However, on the speed side,
Twig is predictably very slow; it takes nearly two minutes to produce
the results comparable to those of its C counterparts.
</p><p>That’s it for tonight; as one friend of mine told me, more interesting
benchmarks will probably come from actual users.
</p>
tag:blog.huoc.org,2009:posts/5
XML as a tree representation
2009-06-10T15:33:14+02:002009-06-10T15:33:14+02:00Nhat Minh Lê (rz0)
<p>xmltools is a project whose ultimate goal is to bring Unix
command-line efficiency to the XML world (and not, as many might
think, to bring XML into the Unix world, which is being done already,
though not by my fault). The idea is to treat XML as a generic tree
representation format.
</p><p>This seems very straightforward; in fact, one could argue that XML was
designed from the start to be a generic tree representation. However,
XML has a number of problems that make it hard to use in this role.
</p><p>The first such problems is doctype declarations. While the bare XML
structure, which we see most of the time, is very simple, doctype
declarations and their semantics are both unneeded and unwanted in
tree <i>(syntax)</i> manipulation programs. It would be fine if doctype
support was optional for non-validating XML processors, but as
I understand it, the XML standard mandates support for a portion of
it: entity declarations and references, and default attribute values.
</p><p>The second issue is with white space handling: XML does not include
any default behavior regarding white space. XML parsers must pass all
white space to the application, which is responsible for <em>some
sensible default processing.</em> The <code>xml:space</code> attribute can give
a hint to the application but as far as I know, it’s not in wide use.
</p><p>This all led me to develop a set of <i>compatibility rules</i> which
precisely describe what one should expect from xmltools, no matter
which backend is used (in case we move away from expat, which I plan
to do, eventually). This subset is sufficient to describe any
tree-like data set.
</p><p>The rules will be maintained in the <code>doc/compatxml.text</code> file, in the
xmltools source tree. For convenience, I’ve reproduced the current
version below:
</p><ol><li><p>Do not use doctype declarations for any purpose other than
validation. In particular, do <em>not</em> rely on doctype declarations to
provide default attributes and do <em>not</em> use internal entities for
arbitrary substitutions; if you wanted to save typing, you wouldn’t
be using XML. Use of numeric and predefined entities is OK. Use of
directly-encoded characters is preferred.
</p></li><li><p>Do not presuppose the XML processor has access to any resource
beside your file: do not use external entities.
</p></li><li><p>Do not assume the XML processor is able to handle multiple
character sets: all the documents and other data supplied to the
program should use the same character set. Also, set your locale
appropriately.
</p></li><li><p>Unless <code>xml:space</code> is set, assume all white-space only text
elements are ignored by the XML processor.
</p></li><li><p>Set <code>xml:space</code> to <code>preserve</code> whenever white space not kept by the
previous rule must be preserved. As stated in the first rule, do not
rely on doctype declarations to implicitly set this
attribute. Please note however that the fourth rule works well for
most purposes, including processing of HTML <code>pre</code> elements which
only contain verbatim text.
</p></li></ol>