More benchmarks: the bitvopt branch
Those who have been keeping an eye on my Git repository have surely
noticed the recent update of the bitvopt branch, which contains
alternate implementations of the core routines of the matching engine
(match.c): followsibling and followparent. The bitvopt code
replaces the direct tests and propagation entirely with bit vector
operations (masking, shifting, etc.).
The main advantage of this approach is vectorization. Such
operations are easily done on a single unit (the type of which is
defined in bitv.h and somewhat configurable with the USE_STDINT
macro) and can be generalized to loops that carry no dependencies
between iterations. These loops can then be auto-vectorized by
a sufficiently smart compiler. The current code in bitvopt is
understood and vectorized by both GCC 4 with -ftree-vectorize and
ICC 10.1. Beware, ICC 10.0 fails because of the restrict qualifiers,
which are necessary; also note that, on x86, you have to pass
USE_STDINT in order to select a type bigger than unsigned char, or
else, SIMD shift operations won’t be available. To see the
vectorization reports, you can use -ftree-vectorizer-verbose=5 for
GCC and -vec-report for ICC.
The results are not quite satisfactory and the branch was not merged
into master:
On the chart, GCC has vectorization disabled by default while ICC
always has vectorization (even though I have not specified -vect
anywhere). The measurements have been done on my old Pentium 4, with
-O2 and appropriate architecture flags, with GCC 4.1.2 and ICC
10.1. The values are averages computed over 100 runs of the following
pattern on the same dictionary as before:
$ xmlgrep 'p[hw/?="Ab\"a*cus"]#'
I must say I can’t explain why operations on native integers (the
-int builds) are slower than those on bytes. As for the relative
gain of non-vectorized code over vectorized one, this is most probably
because the pattern is not long enough (less than one base unit) to
take advantage of SIMD so we lose in the overhead introduced with
vectorization (all the more so as data may be unaligned on entry of
these functions). Lastly, it seems auto-vectorization in GCC really is
not advantageous right now, since the non-vectorized versions tend to
outperform their vectorized counterparts when compiled with GCC (and
this is not the case with ICC).
Anyway, this was just a side experiment of mine; work on the main tools continue: plans for more features in the matcher (on request, case insensitivity, arithmetic predicates, maybe some other things) as well as the introduction of xmlsed are underway.
![[Atom]](feed.png)