Ours & Hippy — le blogOurs & Hippyourshippy@huoc.orgtag:blog.huoc.org,2009:atom2009-06-12T01:06:18+02:00tag:blog.huoc.org,2009:posts/6
xmlgrep toy benchmarks
2009-06-12T01:06:18+02:002009-06-12T01:06:18+02:00Nhat Minh Lê (rz0)
As xmlgrep is approaching a pre-pre-alpha release (something that is
still very experimental, but nonetheless working, somewhat :), I’ve
been doing some basic tests on it, including one simple benchmark. The
results are, IMHO, encouraging. The benchmark pits xmlgrep against
equivalent existing software:
<ul><li>the selection tool from the <a class="extern" href="http://xmlstar.sourceforge.net/">xmlstarlet project</a> (available in
pkgsrc as <code>textproc/xmlstarlet</code>);
</li><li>another xmlgrep from the Perl <a class="extern" href="http://xmltwig.com/">Twig</a> package (<code>textproc/p5-XML-Twig</code>
in pkgsrc);
</li><li>and GNU grep, though it does not really accomplish the same, from
the base NetBSD-5 distribution, as a reference.
</li></ul><p>Since each tool accepts its own patterns and treats them differently,
I’ve tried my best to get meaningful results with each, but obviously,
the behaviors obtained differ somewhat.
</p><p>The <code>tests/memplot</code> script was used to make the sampling. The data
were plotted using <a class="extern" href="http://www.gnuplot.info/">Gnuplot</a>. The following commands were run:
</p><pre><code>$ xmlgrep 'hw/?="Ab\"a*cus"' d.xml
$ xmlgrep 'hw[?="Ab\"a*cus"]' d.xml
$ xml sel -t -c "//hw[text()='Ab&quot;a*cus']" d.xml
$ xml_grep "hw[string()='Ab\"a*cus']" d.xml
$ grep 'Ab"a\*cus' d.xml
</code></pre><p>The goal was to retrieve the <code><hw></code> element which contains the
<code>Ab"a*cus</code> string from a 53M dictionary, retrieved from
<a class="extern" href="http://www.ibiblio.org/webster/">http://www.ibiblio.org/webster/</a> and merged into a single file.
</p><p>The first two lines correspond to two different invocations of
xmlgrep, which do not do exactly the same thing, and with the first
being the <em>good one,</em> the other inefficiently abusing the look-ahead
mechanism. Yet, it was included for comparison purposes and maybe to
be fair to the XPath-based alternatives, which have to rely on
a predicate.
</p><p>Results follow; the second and third pictures are just zooms which
omit one or another of the candidates.
</p><div class="Avertissement"><p>I <em>know</em> these results don’t mean much, so don’t take them too
seriously. I’ve only tested a single simple pattern. Actually,
I didn’t really intend to make a comparison, at first, I just wanted
to make sure xmlgrep doesn’t leak or otherwise misuse memory. But
since the sampling code was there, I thought it’d be fun to do some
additional measurements.
</p></div><div class="Affiche"><img src="http://blog.huoc.org/media/6-1-bench.png" alt="Overall benchmark" />
</div><div class="Affiche"><img src="http://blog.huoc.org/media/6-2-lomem.png" alt="Low-memory candidates benchmark" />
</div><div class="Affiche"><img src="http://blog.huoc.org/media/6-3-fast.png" alt="Fast candidates benchmark" />
</div><p>As expected, xmlstarlet, which uses a full DOM model requires a lot of
memory to do its work (more than 400M!). But it is the fastest.
</p><p>My xmlgrep, when used right, is still nearly 42% slower than
xmlstarlet (from above 4s to below 6s), but I think the times remain
reasonable; memory-wise, it is also the lightest, with a constant 2.9M
in use throughout the whole run. Even GNU grep requires about 2.8M.
</p><p>Surprisingly, the Twig xmlgrep is not too big a memory killer, though
its memory usage increases over time, by steps (though it looks linear
on the long run). This appears somewhat strange to me; why some nodes
should be pruned from its internal DOM-like tree while others seem to
remain for the duration of the program. However, on the speed side,
Twig is predictably very slow; it takes nearly two minutes to produce
the results comparable to those of its C counterparts.
</p><p>That’s it for tonight; as one friend of mine told me, more interesting
benchmarks will probably come from actual users.
</p>