Not dead yet: xmltools, long after the summer

(Nhat Minh Lê (rz0) @ 2010-04-06 19:29:07)

Soon, it will be the Summer (of Code) again, and this year, too, NetBSD will be participating.α At around the same time, last year, I was busy writing technical proposals for what would become xmltools, one of the accepted NetBSD SoC projects of 2009. Well, most of you have probably forgotten about that small project; some may even think it was a useless attempt and a waste of time and money, which could have been better spent elsewhere… Err, in fact, I’m writing this to say that my xmltools are not dead! And I still intend to contribute the code to the NetBSD tree sometime in the (near) future.

α : And if you’re a student who’s into systems and stuff, looking for a cool organization to join, with a broad selection of projects, why not give NetBSD a try? :)

That being said, if you take a look at the Git repo, you’ll surely notice that nothing has moved for months… If you fool around a bit more, though, you may notice another project on the same Git host: regxml. Yep, this is a rewrite of my xmltools!

Why a rewrite, you ask? Well, basically because the old implementation was flawed by design, in addition to having become a mess.

The little boring story

And here’s the long answer, for those who care. Feel free to skip this section if it bores you.

During the second part of last year’s SoC, as I was trying to work on xmlsed, I realized my design was flawed. The main problem was that I had based all my efforts around a concept that was, for our purpose, completely inappropriate: node sets.

So what’s a node set? A node set is simply that: a set of nodes. Now, that’s great if you are extracting data from an XML document, as did xmlgrep, but it sucks if you’re going to operate further on that. In particular, removing arbitrary nodes from an XML tree doesn’t yield a tree, and replacing a node set made no sense at all. (How would you replace two far away nodes with one big chunk of XML? There are many ways to go about it… but they’re all wrong IMO.)

So what emerged at the end of the summer was that… I needed a new design, and as my original algorithm had become crippled with dubious ad hoc attempts to support operations that it never could, I needed a new algorithm too. (Besides, there was a deficiency in my algorithm that made it perform in super-polynomial time with some patterns and data sets, which was bad.)

It took me something like one month (since I wasn’t full-time on it) to jot down the new concepts and a (hopefully) good algorithm on paper. Then, for months, I’d try to implement it, but each time, something would feel really wrong, and I’d trash my new implementation. Well, that was till some days ago, when somehow, I got it right, and so the new xmltools were born!

XML intervals

So all this boils down to the following fundamental change: XML node sets have been replaced by XML intervals, as the core objects the xmltools operate on. An XML interval is really just that: a bunch of adjacent siblings and all their children. You should read the included xmltools(7) manual page for more information, from a user’s point of view.

From a more theoretical viewpoint, intervals have two important advantages over sets:

  1. Inserting or removing an XML interval in an XML tree keeps it a tree. In other words, XML trees are stable by insertion and deletion of arbitrary XML intervals. (It may not have any meaning anymore, but remember xmltools are about syntax, not semantics.)

  2. XML intervals map (injectively) to byte intervals in the XML file, hence all editing operations on XML intervals (sequences of insertions and deletions) can be expressed as operations on sequences of bytes in the file itself.

The first property is what I am convinced will make xmlsed possible. Previously, with node sets, it was a nightmare to give proper semantics to the various operations. But now, we have simple objects that still retain a great deal of expressivity but can be manipulated easily, with simple rules.

Although I do not use the second property yet, I’m definitely thinking about introducing an API for that sometime in the (not quite near) future.

The idea is that you run the XML matcher on a file mapped in memory (mapping new chunks as needed), and get a program script consisting of byte operations, that you can then run very efficiently on the memory-mapped file! Instead of printing an interval, you’d just copy over the entire memory region. Instead of deleting an interval, you’d just ignore that chunk. You get the idea. It could be used for efficient processing of massive data with complex transformation rules, but we’re not quite there, yet. :)

xmltools and NetBSD

As I’ve discussed with David (<dyoung>), I intend to import my core library, as well as xmlgrep into base sometime in the near future (once it has undergone more thorough testing). It depends on Expat for XML parsing, so I’ll need to import that too.

After much thinking, I think I’ll settle with a subdirectory in src/external/bsd, though I use the NetBSD build system, as I want to keep my development model using Git, and the ability to easily make packages of all my tools for non-NetBSD platforms.

The new code base uses (Net)BSD idioms more extensively than the old one (because I’ve got the time to read more NetBSD code since I started in May of last year). (E.g. I’ve followed David’s suggestion that I should make use of sys/queue.h.) Quite surprisingly, it compiles almost as is on GNU/Linux, with three additional defines and two trivial functions. I do provide a simple GNU makefile in addition to the default NetBSD build system makefiles, for GNU/Linux builds. Of course, you’re welcome to try to build it on other systems I don’t have access to!

What the future holds

Well, what’ll come next depends on a lot of factors, some of which are completely outside of my control. But there is one thing you can do: please download, compile, read the man pages, and test xmlgrep and give me some feedback, so I can fix bugs!

$ git clone git://git.huoc.org/regxml.git

Or through Git-Web: http://git.huoc.org/?p=regxml.git;a=summary

As far as the xmltools project is concerned, I’m now going to concentrate on xmlsed. I already have an idea of the implementation, and I think it’s not going to be too hard… hopefully, I’m not wrong. :)

P.S.: Oh, I forgot to talk about the new algorithm. Well, to get the idea, it’s basically an automaton backed by a backtracking memoizing work-queue-based interpreter, which roams the XML tree and try to match intervals. You should read the xmltools(7) man page for a high-level overview of the technique. Maybe I’ll write something more consistent on the subject later…