Ours & Hippy — le blogOurs & Hippyourshippy@huoc.orgtag:blog.huoc.org,2009:atom2009-06-21T11:28:10+02:00tag:blog.huoc.org,2009:posts/9
More benchmarks: the bitvopt branch
2009-06-21T11:28:10+02:002009-06-21T11:28:10+02:00Nhat Minh Lê (rz0)
<p>Those who have been keeping an eye on my Git repository have surely
noticed the recent update of the <code>bitvopt</code> branch, which contains
alternate implementations of the core routines of the matching engine
(<code>match.c</code>): <code>followsibling</code> and <code>followparent</code>. The <code>bitvopt</code> code
replaces the direct tests and propagation entirely with bit vector
operations (masking, shifting, etc.).
</p><p>The main advantage of this approach is <em>vectorization.</em> Such
operations are easily done on a single unit (the type of which is
defined in <code>bitv.h</code> and somewhat configurable with the <code>USE_STDINT</code>
macro) and can be generalized to loops that carry no dependencies
between iterations. These loops can then be auto-vectorized by
a sufficiently smart compiler. The current code in <code>bitvopt</code> is
understood and vectorized by both GCC 4 with <code>-ftree-vectorize</code> and
ICC 10.1. Beware, ICC 10.0 fails because of the <code>restrict</code> qualifiers,
which are necessary; also note that, on x86, you <em>have to</em> pass
<code>USE_STDINT</code> in order to select a type bigger than <code>unsigned char</code>, or
else, SIMD shift operations won’t be available. To see the
vectorization reports, you can use <code>-ftree-vectorizer-verbose=5</code> for
GCC and <code>-vec-report</code> for ICC.
</p><p>The results are not quite satisfactory and the branch was not merged
into <code>master</code>:
</p><div class="Affiche"><img src="http://blog.huoc.org/media/9-1-vect.png" alt="GCC auto-vectorization not doing well…" />
</div><p>On the chart, GCC has vectorization disabled by default while ICC
always has vectorization (even though I have not specified <code>-vect</code>
anywhere). The measurements have been done on my old Pentium 4, with
<code>-O2</code> and appropriate architecture flags, with GCC 4.1.2 and ICC
10.1. The values are averages computed over 100 runs of the following
pattern on the same dictionary as before:
</p><pre><code>$ xmlgrep 'p[hw/?="Ab\"a*cus"]#'
</code></pre><p>I must say I can’t explain why operations on native integers (the
<code>-int</code> builds) are slower than those on bytes. As for the relative
gain of non-vectorized code over vectorized one, this is most probably
because the pattern is not long enough (less than one base unit) to
take advantage of SIMD so we lose in the overhead introduced with
vectorization (all the more so as data may be unaligned on entry of
these functions). Lastly, it seems auto-vectorization in GCC really is
not advantageous right now, since the non-vectorized versions tend to
outperform their vectorized counterparts when compiled with GCC (and
this is not the case with ICC).
</p><p>Anyway, this was just a side experiment of mine; work on the main
tools continue: plans for more features in the matcher (on request,
case insensitivity, arithmetic predicates, maybe some other things) as
well as the introduction of xmlsed are underway.
</p>
tag:blog.huoc.org,2009:posts/7
Improving performances of xmlgrep
2009-06-16T12:28:06+02:002009-06-16T12:28:06+02:00Nhat Minh Lê (rz0)
<p>After spending some days fixing bugs in the xmlgrep matcher code,
I took some time again to look at the performance <i>issues.</i> The
previous measurements showed xmlgrep was slower than xmlstarlet, even
on patterns without subpatterns, which are theoretically optimal for
my implementation. That really bothered me.
</p><p>Profiling seemed to show xmlgrep spent most of its time in bit vector
operations – that much was expected. But my vain attempt (Git branch
<code>bitvopt</code>) at replacing tests (the <code>FOREACHVSTATE</code> macro) with bit
vector operations proved actually slower than the master version.
</p><p>Then I had a talk with David Young, my mentor, and we concluded that
both the input mechanism and the memory allocation pattern were
suboptimal.
</p><p>About the input problem, I’ve replaced the old input feeder, which
used <b>fgets(3)</b>, with direct <b>read(2)</b> calls, and this has indeed
shaved some seconds off the counter. This is especially noticeable for
patterns without subpatterns, since I/O accounts for more in that
case.
</p><p>The memory allocation issue was more serious. Basically, a lot of bit
vectors and subcontext frames were needlessly allocated then almost
immediately freed afterwards. This is due to the following construct
commonly found in patterns (maybe I should optimize this one case
specifically): patterns which start with an anchored subpattern,
typically something like <code>a[b]</code> or <code>{a%b}</code>. This kind of patterns
causes the matcher to try to match <em>every node</em> in the tree against
the subpattern yet fail almost all the time (on all those nodes that
do not match <code>a</code>). Such a failed subpattern match entails a call to
<code>pushsub</code> followed by a call to <code>popsub</code>, which are responsible for
the management of the subcontext stack.
</p><p>I’ve switched to pool allocators for both internal tree nodes and bit
vectors. This has drastically reduced the number of allocations made
during a run. David also pointed out that I could use a pre-allocated
vector instead of a linked list for the subcontext stack, and
I remembered one of the nice properties of my implementation: the
number of subcontext is bounded by the maximal nesting depth of
subpatterns. He also suggested I should use copy-on-write bit vectors,
but that proved unnecessary as I now use pointer swapping instead of
copying then erasing (which was quite stupid in the first place,
I agree).
</p><p>All these changes were developed in the <code>memopt2</code> branch overnight and
are now in <code>master</code>. The same measurements as before show some
exciting improvements, most notably the fact that we are now faster
than libxml on patterns that do not use look-ahead! Basically, <em>on
this example,</em> we are about 50% faster than before.
</p><div class="Affiche"><img src="http://blog.huoc.org/media/7-1-faster.png" alt="Revisited speed comparison" />
</div>