So the Summer of Code ends soon and it’s time to report what has been
done, and what is still lacking. Obviously, everybody can see there is
no xmlsed yet in the repository, so what have I been doing since
I completed xmlgrep one month ago? What went wrong?
Well, if you ask me, nothing really went wrong, I have just deviated
somewhat from my original goals and taken the time to do some things
that weren’t part of the initial plan. And even though I am
disappointed not to have finished, this has brought its share of good
additions. Most of these modifications stem from discussions with
David (Young) about features we’d like to see in my xmltools. Also in
all these three months, I’ve spent about three weeks optimizing the
matcher code.
What we have
Let’s talk about the good stuff first. Of course, the visible part of
my work is the existence of xmlgrep, but most of the code actually
lives in a common static library. It provides:
- definitions for patterns and trees;
- utilities: (tree) buffers, bit vectors, symbol look-up, state and
transition caching;
- a string to pattern parser (that I often just call "pattern compiler");
- an XML parser (which can handle fragments and multi-roots documents,
or work as an interface to Expat, for strict XML parsing);
- an XML pretty printer;
- the code for the matcher itself;
- and a higher-level API used to build the tools themselves.
The library has managed to remain reasonably small, with less than 5k
lines of code, counting comments and everything, and the core matching
algorithm fits in less than 1.5k lines or so (compare this to the
15k-line xpath.c from libxml2). It is stream-oriented and care has
been taken so it eats as little memory as possible while remaining
efficient.
The pattern matching code
The matcher is undeniably the most important and sensible piece of
code in there. It is a blend of a kind-of-NFA and a backtracking
machine, which processes one node at the time (NFA) but
is able to do unlimited look-ahead (backtracking). Recent work has
involved adding new advanced constructs to the matcher (besides
forward and look-ahead mechanisms, which were already implemented) and
improving its performances when look-ahead is in use.
Performances
On the performance side, the main addition is that of a state cache
and pre-computed states created by the pattern compiler (instead of
the matching code itself).
The state cache (implemented as a hash table) operates at two levels :
- propagation (from parent to child and sibling to sibling);
- and filtering (deciding whether based on local properties, states
match or not).
One effect of using a cache is that bit vectors need not be pooled
anymore and a lot of copying has disappeared in favor of pointer
swapping. Also, relying on states pre-computed by the pattern compiler
has shifted the focus from looping bit by bit through bit vectors to
whole-vector operations, most of which can be auto-vectorized by
a sufficiently smart compiler (ICC on Linux can vectorize a dozen of
these in the whole xmlgrep code, but as I’ve pointed before,
GCC auto-vectorizer seems to lag behind its non-use, at least on my
code and setup).
With these optimizations merged in, xmlgrep now outperforms xmlstarlet
both in terms of speed and memory consumption on my silly little
benchmark (for those who have not been there, it consists in grepping
for a word and its definition in a 50M XML-formatted English
dictionary, and yes, it’s simple-minded), even when using suboptimal
look-ahead subpatterns.
Indeed, knowledge of the XML format often enables us to write
optimized patterns by anchoring a more costly part with a basic
predicate. In our case, instead of searching for
p[hw/.="somestring"] (the suboptimal case), we could search for
body/p[hw/.="somestring"] since p can only occur inside
body. This is an improvement since the p[] part, being
a look-ahead subpattern, is sensibly slower than a forward pattern, so
reducing the number of tentative matches is essential.
Compared to benchmarks done at the time of the release of xmlgrep,
handling of this suboptimal case is about three times faster now.
Features
On the feature side, which is probably more interesting for casual
would-be users, I have added grouping, Kleene-like operators, and
subpattern negation, as well as reworked some of the old syntax.
Subpattern negation is the simplest extension, algthough it’s fairly
powerful. It can be achieved because of backtracking so it’s only
available on (look-ahead) subpatterns, not groups (see
below). Negation lets you write something like {a}!{a/b} which
selects all elements a that do not have a b element. This can’t
be done with regular patterns because a node can have many children,
and thus, unlike with regexes, something like a/!b (would-be syntax)
would match some a node with a child which is not a b, but there
can be many other children.
For example:
$ xmlgrep -x '{dict}!{dict/@id}/key/.' test.plist
will extract only keys from unnamed dictionaries.
The second addition is Kleene-like operators. Well, sorry about the
name, but it’s just what comes to mind when I use these. They are, as
the name implies, somewhat like their regex counter part, the Kleene
star, in that they match zero or more repetitions of the preceding
predicate. E.g. a*%b matches <a/><a/><a/>...<b/> but not, say,
<a/><c/><d/>...<b/>. The %% and // operators have now become
special cases of the Kleene operators. a%%b can be written a%**%b
and a//b is really a/**/b, where the first star stands for "any
node" and the second is part of the operator.
But having these repetition operators limited to repeating a single
node predicate would have been quite boring, and so groups were
born. Groups, written inside parentheses () (notice these have
replaced the old subpattern syntax which has been moved back to being
{} from its first days), well, group a pattern fragment, so that it
can be repeated. They also accept alternation so you can write
a%(b|c)*%d, which just seems natural.
Let’s take an example:
$ cat test.xml
<B/><a/><b/><a/><b/><a/><b/><a/><b/><a/><b/><E/>
$ xmlgrep 'B%(a|b)*%E' test.xml
<E/>
Groups are implemented without backtracking so using a group is not
much more costly than using plain forward patterns, and is faster than
using look-ahead subpatterns.
Yet, that is only true of normal groups. When you use a selection
operator (+, #, - or =) on a group, it becomes a capturing
group, a.k.a. look-ahead group. A look-ahead group is really
a look-ahead subpattern combined with a group and has a somewhat
subtle behavior. It behaves like a group, with the following
difference: any selection operator inside a look-ahead group will only
apply if the whole group matches. This can be used to bind selectors
together. Thus we can revisit the classical key/value pair example:
how do we match a key/pair with some properties on the key and some
other on the value? You can always do that with simple look-ahead
subpatterns combined with forward patterns, but groups just make it
more convenient.
Imagine you want to match all key/value pairs where the key contains
the letter a and the value is a string. Easy, with a capturing
group:
$ xmlgrep -x '(key[.~"a"]#%string#)+' test.plist
Other components, reusability
Some people have discussed with me the possibility of making the
library reusable. As it is now, it is not yet ready for wider
exposure, but a lot of work has gone into rationalizing its API, in
order to prepare for that. The API needs to be stabilized and the
various components need to be better segregated.
Currently, the data structures in use are tightly integrated. For
instance, the parser actually builds a tree with nodes able to hold
matching states. Somehow, I would need to break that dependency, so
that the matcher depends on the parser, but not the opposite, so you
could use the XML parser as a stand-alone component (it has a nice
hybrid design). Other parts are less affected, though.
What’s left to do
Now for the bitter part. Obviously, there are still a lot of things
left to do, and I intend to continue working on this project, but
whether it will be integrated into NetBSD at some point in the future
is not for me to decide, though I really hope it will.
In short, here is the plan:
- write support for xmlsed templates in the library;
- write xmlsed proper;
- change the/add a high-level API to the library to make it possible
to write xmljoin (because the high level framework only accepts one
input file at this point in time);
- see what more needs to be adjusted for xmljoin to fit into the
framework;
- write xmljoin proper;
- see what needs to be done for xmlsort to fit into the framework;
- write xmlsort.
Of course, I have given some thoughts to the "see what needs to be
done" phases, but now is not right time to talk about that.
Also, at some point, I’ll need to address several shortcomings of the
current code base:
- add locale support;
- convert my bunch of "regression shell scripts" to proper test suites
(ATF, probably);
- add non-NetBSD build support (currently, I have to hand-compile it
on my Linux box).
And I will also want to:
- clean-up the API and release the core components as a stand-alone
library;
- write documentation on the internals to accompany that library;
- and of course, integrate it into NetBSD, if I am given
permission. :)
Lastly, about xmlsed, though you can see no xmlsed directory in the
repository, I can say it is pretty close to completion since the
pieces needed are almost all there in the library (the only one
missing is the template support, and that’s not the hard part).
Conclusion
The Google Summer of Code is nearing its end, but my summer vacations
are not over yet. I won’t have classes until September, so I plan to
carry on as if nothing much happened at least till the end of
August. After that, I won’t have nearly as much free time, but I’ll do
what I can; let’s not be hasty and wait till then to see how far the
project has gone.
I will continue to update this blog and the code will be available at
my Git repository, as usual. (Anyway, I have
just recently added comments to my blog so feel free to drop by and
give me your opinion.)