Ours & Hippy — le blog Ours & Hippy ourshippy@huoc.org tag:blog.huoc.org,2009:atom 2009-06-12T01:06:18+02:00 tag:blog.huoc.org,2009:posts/6 xmlgrep toy benchmarks 2009-06-12T01:06:18+02:00 2009-06-12T01:06:18+02:00 Nhat Minh Lê (rz0) As xmlgrep is approaching a pre-pre-alpha release (something that is still very experimental, but nonetheless working, somewhat :), I’ve been doing some basic tests on it, including one simple benchmark. The results are, IMHO, encouraging. The benchmark pits xmlgrep against equivalent existing software: <ul><li>the selection tool from the <a class="extern" href="http://xmlstar.sourceforge.net/">xmlstarlet project</a> (available in pkgsrc as <code>textproc/xmlstarlet</code>); </li><li>another xmlgrep from the Perl <a class="extern" href="http://xmltwig.com/">Twig</a> package (<code>textproc/p5-XML-Twig</code> in pkgsrc); </li><li>and GNU grep, though it does not really accomplish the same, from the base NetBSD-5 distribution, as a reference. </li></ul><p>Since each tool accepts its own patterns and treats them differently, I’ve tried my best to get meaningful results with each, but obviously, the behaviors obtained differ somewhat. </p><p>The <code>tests/memplot</code> script was used to make the sampling. The data were plotted using <a class="extern" href="http://www.gnuplot.info/">Gnuplot</a>. The following commands were run: </p><pre><code>$ xmlgrep 'hw/?=&quot;Ab\&quot;a*cus&quot;' d.xml $ xmlgrep 'hw[?=&quot;Ab\&quot;a*cus&quot;]' d.xml $ xml sel -t -c &quot;//hw[text()='Ab&amp;quot;a*cus']&quot; d.xml $ xml_grep &quot;hw[string()='Ab\&quot;a*cus']&quot; d.xml $ grep 'Ab&quot;a\*cus' d.xml </code></pre><p>The goal was to retrieve the <code>&lt;hw&gt;</code> element which contains the <code>Ab&quot;a*cus</code> string from a 53M dictionary, retrieved from <a class="extern" href="http://www.ibiblio.org/webster/">http://www.ibiblio.org/webster/</a> and merged into a single file. </p><p>The first two lines correspond to two different invocations of xmlgrep, which do not do exactly the same thing, and with the first being the <em>good one,</em> the other inefficiently abusing the look-ahead mechanism. Yet, it was included for comparison purposes and maybe to be fair to the XPath-based alternatives, which have to rely on a predicate. </p><p>Results follow; the second and third pictures are just zooms which omit one or another of the candidates. </p><div class="Avertissement"><p>I <em>know</em> these results don’t mean much, so don’t take them too seriously. I’ve only tested a single simple pattern. Actually, I didn’t really intend to make a comparison, at first, I just wanted to make sure xmlgrep doesn’t leak or otherwise misuse memory. But since the sampling code was there, I thought it’d be fun to do some additional measurements. </p></div><div class="Affiche"><img src="http://blog.huoc.org/media/6-1-bench.png" alt="Overall benchmark" /> </div><div class="Affiche"><img src="http://blog.huoc.org/media/6-2-lomem.png" alt="Low-memory candidates benchmark" /> </div><div class="Affiche"><img src="http://blog.huoc.org/media/6-3-fast.png" alt="Fast candidates benchmark" /> </div><p>As expected, xmlstarlet, which uses a full DOM model requires a lot of memory to do its work (more than 400M!). But it is the fastest. </p><p>My xmlgrep, when used right, is still nearly 42% slower than xmlstarlet (from above 4s to below 6s), but I think the times remain reasonable; memory-wise, it is also the lightest, with a constant 2.9M in use throughout the whole run. Even GNU grep requires about 2.8M. </p><p>Surprisingly, the Twig xmlgrep is not too big a memory killer, though its memory usage increases over time, by steps (though it looks linear on the long run). This appears somewhat strange to me; why some nodes should be pruned from its internal DOM-like tree while others seem to remain for the duration of the program. However, on the speed side, Twig is predictably very slow; it takes nearly two minutes to produce the results comparable to those of its C counterparts. </p><p>That’s it for tonight; as one friend of mine told me, more interesting benchmarks will probably come from actual users. </p>