Ours & Hippy — le blog Ours & Hippy ourshippy@huoc.org tag:blog.huoc.org,2009:atom 2009-06-21T11:28:10+02:00 tag:blog.huoc.org,2009:posts/9 More benchmarks: the bitvopt branch 2009-06-21T11:28:10+02:00 2009-06-21T11:28:10+02:00 Nhat Minh Lê (rz0) <p>Those who have been keeping an eye on my Git repository have surely noticed the recent update of the <code>bitvopt</code> branch, which contains alternate implementations of the core routines of the matching engine (<code>match.c</code>): <code>followsibling</code> and <code>followparent</code>. The <code>bitvopt</code> code replaces the direct tests and propagation entirely with bit vector operations (masking, shifting,&#xA0;etc.). </p><p>The main advantage of this approach is <em>vectorization.</em> Such operations are easily done on a single unit (the type of which is defined in <code>bitv.h</code> and somewhat configurable with the <code>USE_STDINT</code> macro) and can be generalized to loops that carry no dependencies between iterations. These loops can then be auto-vectorized by a sufficiently smart compiler. The current code in <code>bitvopt</code> is understood and vectorized by both GCC&#xA0;4 with <code>-ftree-vectorize</code> and ICC&#xA0;10.1. Beware, ICC&#xA0;10.0 fails because of the <code>restrict</code> qualifiers, which are necessary; also note that, on x86, you <em>have to</em> pass <code>USE_STDINT</code> in order to select a type bigger than <code>unsigned char</code>, or else, SIMD shift operations won’t be available. To see the vectorization reports, you can use <code>-ftree-vectorizer-verbose=5</code> for GCC and <code>-vec-report</code> for ICC. </p><p>The results are not quite satisfactory and the branch was not merged into <code>master</code>: </p><div class="Affiche"><img src="http://blog.huoc.org/media/9-1-vect.png" alt="GCC auto-vectorization not doing well…" /> </div><p>On the chart, GCC has vectorization disabled by default while ICC always has vectorization (even though I have not specified <code>-vect</code> anywhere). The measurements have been done on my old Pentium&#xA0;4, with <code>-O2</code> and appropriate architecture flags, with GCC 4.1.2 and ICC 10.1. The values are averages computed over 100 runs of the following pattern on the same dictionary as before: </p><pre><code>$ xmlgrep 'p[hw/?=&quot;Ab\&quot;a*cus&quot;]#' </code></pre><p>I must say I can’t explain why operations on native integers (the <code>-int</code> builds) are slower than those on bytes. As for the relative gain of non-vectorized code over vectorized one, this is most probably because the pattern is not long enough (less than one base unit) to take advantage of SIMD so we lose in the overhead introduced with vectorization (all the more so as data may be unaligned on entry of these functions). Lastly, it seems auto-vectorization in GCC really is not advantageous right now, since the non-vectorized versions tend to outperform their vectorized counterparts when compiled with GCC (and this is not the case with ICC). </p><p>Anyway, this was just a side experiment of mine; work on the main tools continue: plans for more features in the matcher (on request, case insensitivity, arithmetic predicates, maybe some other things) as well as the introduction of xmlsed are underway. </p> tag:blog.huoc.org,2009:posts/7 Improving performances of xmlgrep 2009-06-16T12:28:06+02:00 2009-06-16T12:28:06+02:00 Nhat Minh Lê (rz0) <p>After spending some days fixing bugs in the xmlgrep matcher code, I took some time again to look at the performance <i>issues.</i> The previous measurements showed xmlgrep was slower than xmlstarlet, even on patterns without subpatterns, which are theoretically optimal for my implementation. That really bothered me. </p><p>Profiling seemed to show xmlgrep spent most of its time in bit vector operations – that much was expected. But my vain attempt (Git branch <code>bitvopt</code>) at replacing tests (the <code>FOREACHVSTATE</code> macro) with bit vector operations proved actually slower than the master version. </p><p>Then I had a talk with David Young, my mentor, and we concluded that both the input mechanism and the memory allocation pattern were suboptimal. </p><p>About the input problem, I’ve replaced the old input feeder, which used <b>fgets(3)</b>, with direct <b>read(2)</b> calls, and this has indeed shaved some seconds off the counter. This is especially noticeable for patterns without subpatterns, since I/O accounts for more in that case. </p><p>The memory allocation issue was more serious. Basically, a lot of bit vectors and subcontext frames were needlessly allocated then almost immediately freed afterwards. This is due to the following construct commonly found in patterns (maybe I should optimize this one case specifically): patterns which start with an anchored subpattern, typically something like <code>a[b]</code> or <code>{a%b}</code>. This kind of patterns causes the matcher to try to match <em>every node</em> in the tree against the subpattern yet fail almost all the time (on all those nodes that do not match <code>a</code>). Such a failed subpattern match entails a call to <code>pushsub</code> followed by a call to <code>popsub</code>, which are responsible for the management of the subcontext stack. </p><p>I’ve switched to pool allocators for both internal tree nodes and bit vectors. This has drastically reduced the number of allocations made during a run. David also pointed out that I could use a pre-allocated vector instead of a linked list for the subcontext stack, and I remembered one of the nice properties of my implementation: the number of subcontext is bounded by the maximal nesting depth of subpatterns. He also suggested I should use copy-on-write bit vectors, but that proved unnecessary as I now use pointer swapping instead of copying then erasing (which was quite stupid in the first place, I agree). </p><p>All these changes were developed in the <code>memopt2</code> branch overnight and are now in <code>master</code>. The same measurements as before show some exciting improvements, most notably the fact that we are now faster than libxml on patterns that do not use look-ahead! Basically, <em>on this example,</em> we are about 50% faster than before. </p><div class="Affiche"><img src="http://blog.huoc.org/media/7-1-faster.png" alt="Revisited speed comparison" /> </div> tag:blog.huoc.org,2009:posts/6 xmlgrep toy benchmarks 2009-06-12T01:06:18+02:00 2009-06-12T01:06:18+02:00 Nhat Minh Lê (rz0) As xmlgrep is approaching a pre-pre-alpha release (something that is still very experimental, but nonetheless working, somewhat :), I’ve been doing some basic tests on it, including one simple benchmark. The results are, IMHO, encouraging. The benchmark pits xmlgrep against equivalent existing software: <ul><li>the selection tool from the <a class="extern" href="http://xmlstar.sourceforge.net/">xmlstarlet project</a> (available in pkgsrc as <code>textproc/xmlstarlet</code>); </li><li>another xmlgrep from the Perl <a class="extern" href="http://xmltwig.com/">Twig</a> package (<code>textproc/p5-XML-Twig</code> in pkgsrc); </li><li>and GNU grep, though it does not really accomplish the same, from the base NetBSD-5 distribution, as a reference. </li></ul><p>Since each tool accepts its own patterns and treats them differently, I’ve tried my best to get meaningful results with each, but obviously, the behaviors obtained differ somewhat. </p><p>The <code>tests/memplot</code> script was used to make the sampling. The data were plotted using <a class="extern" href="http://www.gnuplot.info/">Gnuplot</a>. The following commands were run: </p><pre><code>$ xmlgrep 'hw/?=&quot;Ab\&quot;a*cus&quot;' d.xml $ xmlgrep 'hw[?=&quot;Ab\&quot;a*cus&quot;]' d.xml $ xml sel -t -c &quot;//hw[text()='Ab&amp;quot;a*cus']&quot; d.xml $ xml_grep &quot;hw[string()='Ab\&quot;a*cus']&quot; d.xml $ grep 'Ab&quot;a\*cus' d.xml </code></pre><p>The goal was to retrieve the <code>&lt;hw&gt;</code> element which contains the <code>Ab&quot;a*cus</code> string from a 53M dictionary, retrieved from <a class="extern" href="http://www.ibiblio.org/webster/">http://www.ibiblio.org/webster/</a> and merged into a single file. </p><p>The first two lines correspond to two different invocations of xmlgrep, which do not do exactly the same thing, and with the first being the <em>good one,</em> the other inefficiently abusing the look-ahead mechanism. Yet, it was included for comparison purposes and maybe to be fair to the XPath-based alternatives, which have to rely on a predicate. </p><p>Results follow; the second and third pictures are just zooms which omit one or another of the candidates. </p><div class="Avertissement"><p>I <em>know</em> these results don’t mean much, so don’t take them too seriously. I’ve only tested a single simple pattern. Actually, I didn’t really intend to make a comparison, at first, I just wanted to make sure xmlgrep doesn’t leak or otherwise misuse memory. But since the sampling code was there, I thought it’d be fun to do some additional measurements. </p></div><div class="Affiche"><img src="http://blog.huoc.org/media/6-1-bench.png" alt="Overall benchmark" /> </div><div class="Affiche"><img src="http://blog.huoc.org/media/6-2-lomem.png" alt="Low-memory candidates benchmark" /> </div><div class="Affiche"><img src="http://blog.huoc.org/media/6-3-fast.png" alt="Fast candidates benchmark" /> </div><p>As expected, xmlstarlet, which uses a full DOM model requires a lot of memory to do its work (more than 400M!). But it is the fastest. </p><p>My xmlgrep, when used right, is still nearly 42% slower than xmlstarlet (from above 4s to below 6s), but I think the times remain reasonable; memory-wise, it is also the lightest, with a constant 2.9M in use throughout the whole run. Even GNU grep requires about 2.8M. </p><p>Surprisingly, the Twig xmlgrep is not too big a memory killer, though its memory usage increases over time, by steps (though it looks linear on the long run). This appears somewhat strange to me; why some nodes should be pruned from its internal DOM-like tree while others seem to remain for the duration of the program. However, on the speed side, Twig is predictably very slow; it takes nearly two minutes to produce the results comparable to those of its C counterparts. </p><p>That’s it for tonight; as one friend of mine told me, more interesting benchmarks will probably come from actual users. </p>