So the Summer of Code ends soon and it’s time to report what has been done, and what is still lacking. Obviously, everybody can see there is no xmlsed yet in the repository, so what have I been doing since I completed xmlgrep one month ago? What went wrong?
Well, if you ask me, nothing really went wrong, I have just deviated somewhat from my original goals and taken the time to do some things that weren’t part of the initial plan. And even though I am disappointed not to have finished, this has brought its share of good additions. Most of these modifications stem from discussions with David (Young) about features we’d like to see in my xmltools. Also in all these three months, I’ve spent about three weeks optimizing the matcher code.
What we have
Let’s talk about the good stuff first. Of course, the visible part of my work is the existence of xmlgrep, but most of the code actually lives in a common static library. It provides:
- definitions for patterns and trees;
- utilities: (tree) buffers, bit vectors, symbol look-up, state and transition caching;
- a string to pattern parser (that I often just call "pattern compiler");
- an XML parser (which can handle fragments and multi-roots documents, or work as an interface to Expat, for strict XML parsing);
- an XML pretty printer;
- the code for the matcher itself;
- and a higher-level API used to build the tools themselves.
The library has managed to remain reasonably small, with less than 5k
lines of code, counting comments and everything, and the core matching
algorithm fits in less than 1.5k lines or so (compare this to the
15k-line xpath.c from libxml2). It is stream-oriented and care has
been taken so it eats as little memory as possible while remaining
efficient.
The pattern matching code
The matcher is undeniably the most important and sensible piece of code in there. It is a blend of a kind-of-NFA and a backtracking machine, which processes one node at the time (NFA) but is able to do unlimited look-ahead (backtracking). Recent work has involved adding new advanced constructs to the matcher (besides forward and look-ahead mechanisms, which were already implemented) and improving its performances when look-ahead is in use.
Performances
On the performance side, the main addition is that of a state cache and pre-computed states created by the pattern compiler (instead of the matching code itself).
The state cache (implemented as a hash table) operates at two levels :
- propagation (from parent to child and sibling to sibling);
- and filtering (deciding whether based on local properties, states match or not).
One effect of using a cache is that bit vectors need not be pooled anymore and a lot of copying has disappeared in favor of pointer swapping. Also, relying on states pre-computed by the pattern compiler has shifted the focus from looping bit by bit through bit vectors to whole-vector operations, most of which can be auto-vectorized by a sufficiently smart compiler (ICC on Linux can vectorize a dozen of these in the whole xmlgrep code, but as I’ve pointed before, GCC auto-vectorizer seems to lag behind its non-use, at least on my code and setup).
With these optimizations merged in, xmlgrep now outperforms xmlstarlet both in terms of speed and memory consumption on my silly little benchmark (for those who have not been there, it consists in grepping for a word and its definition in a 50M XML-formatted English dictionary, and yes, it’s simple-minded), even when using suboptimal look-ahead subpatterns.
Indeed, knowledge of the XML format often enables us to write
optimized patterns by anchoring a more costly part with a basic
predicate. In our case, instead of searching for
p[hw/.="somestring"] (the suboptimal case), we could search for
body/p[hw/.="somestring"] since p can only occur inside
body. This is an improvement since the p[] part, being
a look-ahead subpattern, is sensibly slower than a forward pattern, so
reducing the number of tentative matches is essential.
Compared to benchmarks done at the time of the release of xmlgrep, handling of this suboptimal case is about three times faster now.
Features
On the feature side, which is probably more interesting for casual would-be users, I have added grouping, Kleene-like operators, and subpattern negation, as well as reworked some of the old syntax.
Subpattern negation is the simplest extension, algthough it’s fairly
powerful. It can be achieved because of backtracking so it’s only
available on (look-ahead) subpatterns, not groups (see
below). Negation lets you write something like {a}!{a/b} which
selects all elements a that do not have a b element. This can’t
be done with regular patterns because a node can have many children,
and thus, unlike with regexes, something like a/!b (would-be syntax)
would match some a node with a child which is not a b, but there
can be many other children.
For example:
$ xmlgrep -x '{dict}!{dict/@id}/key/.' test.plist
will extract only keys from unnamed dictionaries.
The second addition is Kleene-like operators. Well, sorry about the
name, but it’s just what comes to mind when I use these. They are, as
the name implies, somewhat like their regex counter part, the Kleene
star, in that they match zero or more repetitions of the preceding
predicate. E.g. a*%b matches <a/><a/><a/>...<b/> but not, say,
<a/><c/><d/>...<b/>. The %% and // operators have now become
special cases of the Kleene operators. a%%b can be written a%**%b
and a//b is really a/**/b, where the first star stands for "any
node" and the second is part of the operator.
But having these repetition operators limited to repeating a single
node predicate would have been quite boring, and so groups were
born. Groups, written inside parentheses () (notice these have
replaced the old subpattern syntax which has been moved back to being
{} from its first days), well, group a pattern fragment, so that it
can be repeated. They also accept alternation so you can write
a%(b|c)*%d, which just seems natural.
Let’s take an example:
$ cat test.xml
<B/><a/><b/><a/><b/><a/><b/><a/><b/><a/><b/><E/>
$ xmlgrep 'B%(a|b)*%E' test.xml
<E/>
Groups are implemented without backtracking so using a group is not much more costly than using plain forward patterns, and is faster than using look-ahead subpatterns.
Yet, that is only true of normal groups. When you use a selection
operator (+, #, - or =) on a group, it becomes a capturing
group, a.k.a. look-ahead group. A look-ahead group is really
a look-ahead subpattern combined with a group and has a somewhat
subtle behavior. It behaves like a group, with the following
difference: any selection operator inside a look-ahead group will only
apply if the whole group matches. This can be used to bind selectors
together. Thus we can revisit the classical key/value pair example:
how do we match a key/pair with some properties on the key and some
other on the value? You can always do that with simple look-ahead
subpatterns combined with forward patterns, but groups just make it
more convenient.
Imagine you want to match all key/value pairs where the key contains
the letter a and the value is a string. Easy, with a capturing
group:
$ xmlgrep -x '(key[.~"a"]#%string#)+' test.plist
Other components, reusability
Some people have discussed with me the possibility of making the library reusable. As it is now, it is not yet ready for wider exposure, but a lot of work has gone into rationalizing its API, in order to prepare for that. The API needs to be stabilized and the various components need to be better segregated.
Currently, the data structures in use are tightly integrated. For instance, the parser actually builds a tree with nodes able to hold matching states. Somehow, I would need to break that dependency, so that the matcher depends on the parser, but not the opposite, so you could use the XML parser as a stand-alone component (it has a nice hybrid design). Other parts are less affected, though.
What’s left to do
Now for the bitter part. Obviously, there are still a lot of things left to do, and I intend to continue working on this project, but whether it will be integrated into NetBSD at some point in the future is not for me to decide, though I really hope it will.
In short, here is the plan:
- write support for xmlsed templates in the library;
- write xmlsed proper;
- change the/add a high-level API to the library to make it possible to write xmljoin (because the high level framework only accepts one input file at this point in time);
- see what more needs to be adjusted for xmljoin to fit into the framework;
- write xmljoin proper;
- see what needs to be done for xmlsort to fit into the framework;
- write xmlsort.
Of course, I have given some thoughts to the "see what needs to be done" phases, but now is not right time to talk about that.
Also, at some point, I’ll need to address several shortcomings of the current code base:
- add locale support;
- convert my bunch of "regression shell scripts" to proper test suites (ATF, probably);
- add non-NetBSD build support (currently, I have to hand-compile it on my Linux box).
And I will also want to:
- clean-up the API and release the core components as a stand-alone library;
- write documentation on the internals to accompany that library;
- and of course, integrate it into NetBSD, if I am given permission. :)
Lastly, about xmlsed, though you can see no xmlsed directory in the
repository, I can say it is pretty close to completion since the
pieces needed are almost all there in the library (the only one
missing is the template support, and that’s not the hard part).
Conclusion
The Google Summer of Code is nearing its end, but my summer vacations are not over yet. I won’t have classes until September, so I plan to carry on as if nothing much happened at least till the end of August. After that, I won’t have nearly as much free time, but I’ll do what I can; let’s not be hasty and wait till then to see how far the project has gone.
I will continue to update this blog and the code will be available at my Git repository, as usual. (Anyway, I have just recently added comments to my blog so feel free to drop by and give me your opinion.)