After spending some days fixing bugs in the xmlgrep matcher code, I took some time again to look at the performance issues. The previous measurements showed xmlgrep was slower than xmlstarlet, even on patterns without subpatterns, which are theoretically optimal for my implementation. That really bothered me.
Profiling seemed to show xmlgrep spent most of its time in bit vector
operations – that much was expected. But my vain attempt (Git branch
bitvopt) at replacing tests (the FOREACHVSTATE macro) with bit
vector operations proved actually slower than the master version.
Then I had a talk with David Young, my mentor, and we concluded that both the input mechanism and the memory allocation pattern were suboptimal.
About the input problem, I’ve replaced the old input feeder, which used fgets(3), with direct read(2) calls, and this has indeed shaved some seconds off the counter. This is especially noticeable for patterns without subpatterns, since I/O accounts for more in that case.
The memory allocation issue was more serious. Basically, a lot of bit
vectors and subcontext frames were needlessly allocated then almost
immediately freed afterwards. This is due to the following construct
commonly found in patterns (maybe I should optimize this one case
specifically): patterns which start with an anchored subpattern,
typically something like a[b] or {a%b}. This kind of patterns
causes the matcher to try to match every node in the tree against
the subpattern yet fail almost all the time (on all those nodes that
do not match a). Such a failed subpattern match entails a call to
pushsub followed by a call to popsub, which are responsible for
the management of the subcontext stack.
I’ve switched to pool allocators for both internal tree nodes and bit vectors. This has drastically reduced the number of allocations made during a run. David also pointed out that I could use a pre-allocated vector instead of a linked list for the subcontext stack, and I remembered one of the nice properties of my implementation: the number of subcontext is bounded by the maximal nesting depth of subpatterns. He also suggested I should use copy-on-write bit vectors, but that proved unnecessary as I now use pointer swapping instead of copying then erasing (which was quite stupid in the first place, I agree).
All these changes were developed in the memopt2 branch overnight and
are now in master. The same measurements as before show some
exciting improvements, most notably the fact that we are now faster
than libxml on patterns that do not use look-ahead! Basically, on
this example, we are about 50% faster than before.