[Atom] [Mail] [Twitter]
Links: git · hacks · misc · cabal · buzz! · about +
Today’s menu
\/

Not dead yet: xmltools, long after the summer

#. By rz0. Posted on 04/06 2010 at 07:29 PM. 2 comments.
netbsd parsing xmlgrep xmlsed xmltools

Soon, it will be the Summer (of Code) again, and this year, too, NetBSD will be participating.α At around the same time, last year, I was busy writing technical proposals for what would become xmltools, one of the accepted NetBSD SoC projects of 2009. Well, most of you have probably forgotten about that small project; some may even think it was a useless attempt and a waste of time and money, which could have been better spent elsewhere… Err, in fact, I’m writing this to say that my xmltools are not dead! And I still intend to contribute the code to the NetBSD tree sometime in the (near) future.

α : And if you’re a student who’s into systems and stuff, looking for a cool organization to join, with a broad selection of projects, why not give NetBSD a try? :)

That being said, if you take a look at the Git repo, you’ll surely notice that nothing has moved for months… If you fool around a bit more, though, you may notice another project on the same Git host: regxml. Yep, this is a rewrite of my xmltools!

Why a rewrite, you ask? Well, basically because the old implementation was flawed by design, in addition to having become a mess.

The little boring story

And here’s the long answer, for those who care. Feel free to skip this section if it bores you.

During the second part of last year’s SoC, as I was trying to work on xmlsed, I realized my design was flawed. The main problem was that I had based all my efforts around a concept that was, for our purpose, completely inappropriate: node sets.

So what’s a node set? A node set is simply that: a set of nodes. Now, that’s great if you are extracting data from an XML document, as did xmlgrep, but it sucks if you’re going to operate further on that. In particular, removing arbitrary nodes from an XML tree doesn’t yield a tree, and replacing a node set made no sense at all. (How would you replace two far away nodes with one big chunk of XML? There are many ways to go about it… but they’re all wrong IMO.)

So what emerged at the end of the summer was that… I needed a new design, and as my original algorithm had become crippled with dubious ad hoc attempts to support operations that it never could, I needed a new algorithm too. (Besides, there was a deficiency in my algorithm that made it perform in super-polynomial time with some patterns and data sets, which was bad.)

It took me something like one month (since I wasn’t full-time on it) to jot down the new concepts and a (hopefully) good algorithm on paper. Then, for months, I’d try to implement it, but each time, something would feel really wrong, and I’d trash my new implementation. Well, that was till some days ago, when somehow, I got it right, and so the new xmltools were born!

XML intervals

So all this boils down to the following fundamental change: XML node sets have been replaced by XML intervals, as the core objects the xmltools operate on. An XML interval is really just that: a bunch of adjacent siblings and all their children. You should read the included xmltools(7) manual page for more information, from a user’s point of view.

From a more theoretical viewpoint, intervals have two important advantages over sets:

  1. Inserting or removing an XML interval in an XML tree keeps it a tree. In other words, XML trees are stable by insertion and deletion of arbitrary XML intervals. (It may not have any meaning anymore, but remember xmltools are about syntax, not semantics.)

  2. XML intervals map (injectively) to byte intervals in the XML file, hence all editing operations on XML intervals (sequences of insertions and deletions) can be expressed as operations on sequences of bytes in the file itself.

The first property is what I am convinced will make xmlsed possible. Previously, with node sets, it was a nightmare to give proper semantics to the various operations. But now, we have simple objects that still retain a great deal of expressivity but can be manipulated easily, with simple rules.

Although I do not use the second property yet, I’m definitely thinking about introducing an API for that sometime in the (not quite near) future.

The idea is that you run the XML matcher on a file mapped in memory (mapping new chunks as needed), and get a program script consisting of byte operations, that you can then run very efficiently on the memory-mapped file! Instead of printing an interval, you’d just copy over the entire memory region. Instead of deleting an interval, you’d just ignore that chunk. You get the idea. It could be used for efficient processing of massive data with complex transformation rules, but we’re not quite there, yet. :)

xmltools and NetBSD

As I’ve discussed with David (<dyoung>), I intend to import my core library, as well as xmlgrep into base sometime in the near future (once it has undergone more thorough testing). It depends on Expat for XML parsing, so I’ll need to import that too.

After much thinking, I think I’ll settle with a subdirectory in src/external/bsd, though I use the NetBSD build system, as I want to keep my development model using Git, and the ability to easily make packages of all my tools for non-NetBSD platforms.

The new code base uses (Net)BSD idioms more extensively than the old one (because I’ve got the time to read more NetBSD code since I started in May of last year). (E.g. I’ve followed David’s suggestion that I should make use of sys/queue.h.) Quite surprisingly, it compiles almost as is on GNU/Linux, with three additional defines and two trivial functions. I do provide a simple GNU makefile in addition to the default NetBSD build system makefiles, for GNU/Linux builds. Of course, you’re welcome to try to build it on other systems I don’t have access to!

What the future holds

Well, what’ll come next depends on a lot of factors, some of which are completely outside of my control. But there is one thing you can do: please download, compile, read the man pages, and test xmlgrep and give me some feedback, so I can fix bugs!

$ git clone git://git.huoc.org/regxml.git

Or through Git-Web: http://git.huoc.org/?p=regxml.git;a=summary

As far as the xmltools project is concerned, I’m now going to concentrate on xmlsed. I already have an idea of the implementation, and I think it’s not going to be too hard… hopefully, I’m not wrong. :)

P.S.: Oh, I forgot to talk about the new algorithm. Well, to get the idea, it’s basically an automaton backed by a backtracking memoizing work-queue-based interpreter, which roams the XML tree and try to match intervals. You should read the xmltools(7) man page for a high-level overview of the technique. Maybe I’ll write something more consistent on the subject later…

[ tag:blog.huoc.org,2009:posts/48 ]
View comments · Comment

/\ \/

xmltools implementation: automata and backtracking

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 07/20 2009 at 12:39 AM. No comments.
<. xmltools update: new I/O layer and further plans
>. xmltools: what's done, what's left
backtracking nfa parsing xmltools

At the moment, I’m implementing submatches in patterns, in preparation for xmlsed. Since most of what I do draws from formal languages and automata (except I’m too lazy to write anything formal), I’ve been looking into TNFAs for inspiration. Seeing how TRE seems to implement look-ahead without backtracking, I began to ask myself if such a thing could be done for my pattern matcher. Unfortunately, I can’t seem to figure out a solution by myself, so I’m blogging instead. :)

Basically, the problem is with constructs such as a[b]/c which match c, child of a only if a has a child b. However, when we reach a, we don’t know yet whether it has a child b or not, so we’re in need of some look-ahead.

This is akin to a(?=b)c in Perl regexes, but as you can already see, it’s fairly different. Perhaps the most obvious difference is that a tree has "two dimensions" while a string has only one, so it’s possible for a to have two children b and c, in a tree, while it’s not possible for a to be followed by two different characters, in a string.

Backtracking

The easiest way to solve this problem is to use backtracking. This is how the matcher is currently implemented. Even though about everything else is based solely on states and transitions, the part implementing look-ahead predicates uses backtracking.

The algorithm looks like this: everytime we have to match a look-ahead predicate, we stop, read as many nodes as we need to decide and then either drop the state or keep it, depending on the outcome.

This is a simple approach and it works well (i.e. it gives the expected results). Only, it’s slow. As a comparison, using the patterns I’ve used before for my benchmarks, the difference between hw/.="Ab\"a*cus" and p[hw/.="Ab\"a*cus"]# is the second one is about 2.66 times slower. This can be hand-optimized by rewriting the second pattern as body/p[hw/.="Ab\"a*cus"]#, with knowledge of the input format, where p can only appear as a child of body. This cuts the slowness to a mere 1.62 times slower.

At the same time, it demonstrates a major source of inefficiency in my backtracking implementation: loose (non-anchored) subpatterns, which may potentially apply to a great number of nodes yet would fail immediately for most of them, result in a lot of unwanted short backtracking segments where the matcher tries the subpattern and then immediately backtracks after failing, thus still examining the node twice.

Product automata

In theory, a pattern such as a(?=b)c matches if a is followed by both b and c so another natural idea would be to use a product automaton which transitions between couples of states: if we end up in two final states, then we win.

Only, this works well because we want to know whether the string matches as a whole. However, with an XML pattern such as a[b]/c, we need to identify which node matched the c part. This is where things start to get complicated.

Now we do know how to extract a submatch from a string with tagged transitions. So, great, does this apply to our XML matcher? The answer is "no", as far as I know (I’d be glad if someone proved me wrong, though). Tagged transitions are good at extracting one submatch, the last one, but in our case, the problem is we want all occurrences of c, not just the last one.

Keeping this information globally associated with the state is not an option since there may well be multiple instances of the look-ahead predicate trying to match at the same time. But trying to differentiate between two states based on the matching position is not an option either, because that could easily lead to linear memory usage. Consider, for example, the following XML fragment:

<a/><a/><a/><a/><a/>...<b/>

Trying to match {a%%b} (or (a%%b) using the current syntax in the master branch; sorry, I reverted to the old syntax because () will be used for groups) against that would result in a match at every a element. Hence, keeping the position as a state property is not a solution, since we’d have to distinguish between as many a nodes as there are in the input data.

Optimizing the backtracking method

Although the implementation of the matcher has undergone many changes in order to improve performances recently, mostly so as to accommodate new features while keeping a reasonable running time, I believe there’s still room for much improvement, maybe in the form of algorithmic enhancements rather than implementation optimizations.

Some of my more realistic ideas are:

  • Read one element ahead to reject some subpatterns without entering a backtracking context.

  • Try to predict the outcome of a look-ahead predicate and compute the more likely transitions along with the predicate itself, saving on backtracking (except to prune the tree) in case we’re right (but incuring additional recomputations if we’re wrong).

Also, this is not directly related to the backtracking approach, but I’m thinking of putting more information in the state cache; at the moment, we’re only caching "raw" transitions which account for the old state set and the direction (i.e. successor or child); we could try to cache local information as well, which would move us one step closer to the way NFA transition functions get cached to get DFAs.

Conclusion: do we need performance that badly?

Well, that’s a good question. And the answer is "we probably don’t". This is kind of my hobby, so don’t get the wrong idea. The whole thing is not that slow. On my simple example, xmlgrep was only 0.5s slower than libxml-powered xmlstarlet, taking less than six seconds to look up a word definition in a 58M dictionary, on my 2x2GHz Core 2, while using 100 times less memory.

I doubt anybody is going to use xmlgrep to look up words in dictionaries anytime soon; such a specialized task would best be served using an indexed collection of some sort, although I should mention that there are XML formats which are more suited to being parsed using my tools than others, so converting to a more appropriate (e.g. annotated) XML format using xmlsed then searching with xmlgrep could well be a viable solution too (given the conversion is one-time or infrequent compared to searches). More on that once xmlsed is out!

[ tag:blog.huoc.org,2009:posts/16 ]
No comments · Comment

/\

In search of a good pull-style XML parser

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/09 2009 at 11:07 AM. No comments.
>. XML as a tree representation
api parsing xml xmltools

Recently, I’ve been working with expat extensively, as part of my Google Summer of Code project, and have come to the realization that there is, IMHO, no good well-maintained lightweight stream-oriented XML parser.

Of course, one would think expat, at this point, and I thought so too, when I started, only to find myself now locked with an expat-like design, which is both inefficient and unnecessarily complicated (for what I want to do).

So, what is this expat design I am talking about? It is the event-driven (callback-oriented) model, what the XML guys out there call push parsers, though I personally think it doesn’t make much sense to call it that way, so I won’t.

While event-driven parsing is good for some tasks — awk(1) is a really clever tool — it really fails on complex parsing tasks such as those required by my xmltools that rely on partial reprocessing of the document tree.

Reprocessing of the document tree basically means we have to keep an in-memory copy of the parts we want to traverse more than once. Of course, expat can (help us) do that. Using expat, the basic idea is to have two input sources which both call the event handlers. Then, whenever we need to reprocess some part of the tree, we just traverse our in-memory copy and feed the information to the handlers as if they were receiving these from expat.

Now, this seems good at first, but the fact is we now have two parsers: expat and our tree parser. Moreover, they don’t share an identical behavior: on the expat side, we may need to build new nodes to put in the partial tree (for future reprocessing), while in the in-memory tree parser, we may need to free some nodes which are no longer needed. But even in the expat handling code there may be nodes we have to keep (usually the current branch, from the root to the current node); this in turn implies we sometimes have to free some nodes as well. Now we have bits and pieces of the in-memory tree management code everywhere. Things start to look messy…

Besides, when do we call the surrogate parser? The issue here is we don’t want expat to feed new data while we are in the middle of (or more precisely before we can even start) a reprocessing operation. The expat manual points at the XML_StopParser and XML_ResumeParser functions. These seem to be exactly what we need, right? Wrong. Indeed, a more careful reading of the documentation reveals parsing does not stop immediately after the handler that called XML_StopParser; there may be an unspecified number of additional events between the end of that one and the moment the parser actually stops. This makes these routines pretty much useless. Actually, we need to use the secondary input system right on top of expat itself, in the middle of our callbacks, whenever we need this functionality. This is not pretty to say the least…

But there is a more elegant solution: the idea is to have the internal tree management module drive the process. It maintains the in-memory tree and calls the event code on its nodes whenever necessary, possibly going through loops or any patterns that suit its needs, for that matter. Then, when it requires a node that is not present in the tree, it calls upon the input module which returns the next node in the stream. This is the so-called pull method. There really is nothing special about it; in fact, it’s the way most I/O libraries out there are built. Why XML had to be different, I don’t know.

As I’ve said, to the extent of my knowledge, there is no lightweight and well-maintained (I mean not some two-year-old experiment by some obscure hacker, though I respect these things, I don’t have time right now to play with such codes) C library which implements the full XML standard and this pull metaphor. If anyone out there knows, do not hesitate to let me know.

[ tag:blog.huoc.org,2009:posts/3 ]
No comments · Comment