[Atom] [Mail] [Twitter]
Links: git · hacks · misc · cabal · buzz! · about +
Today’s menu
\/

Not dead yet: xmltools, long after the summer

#. By rz0. Posted on 04/06 2010 at 07:29 PM. One comment.
netbsd parsing xmlgrep xmlsed xmltools

Soon, it will be the Summer (of Code) again, and this year, too, NetBSD will be participating.α At around the same time, last year, I was busy writing technical proposals for what would become xmltools, one of the accepted NetBSD SoC projects of 2009. Well, most of you have probably forgotten about that small project; some may even think it was a useless attempt and a waste of time and money, which could have been better spent elsewhere… Err, in fact, I’m writing this to say that my xmltools are not dead! And I still intend to contribute the code to the NetBSD tree sometime in the (near) future.

α : And if you’re a student who’s into systems and stuff, looking for a cool organization to join, with a broad selection of projects, why not give NetBSD a try? :)

That being said, if you take a look at the Git repo, you’ll surely notice that nothing has moved for months… If you fool around a bit more, though, you may notice another project on the same Git host: regxml. Yep, this is a rewrite of my xmltools!

Why a rewrite, you ask? Well, basically because the old implementation was flawed by design, in addition to having become a mess.

The little boring story

And here’s the long answer, for those who care. Feel free to skip this section if it bores you.

During the second part of last year’s SoC, as I was trying to work on xmlsed, I realized my design was flawed. The main problem was that I had based all my efforts around a concept that was, for our purpose, completely inappropriate: node sets.

So what’s a node set? A node set is simply that: a set of nodes. Now, that’s great if you are extracting data from an XML document, as did xmlgrep, but it sucks if you’re going to operate further on that. In particular, removing arbitrary nodes from an XML tree doesn’t yield a tree, and replacing a node set made no sense at all. (How would you replace two far away nodes with one big chunk of XML? There are many ways to go about it… but they’re all wrong IMO.)

So what emerged at the end of the summer was that… I needed a new design, and as my original algorithm had become crippled with dubious ad hoc attempts to support operations that it never could, I needed a new algorithm too. (Besides, there was a deficiency in my algorithm that made it perform in super-polynomial time with some patterns and data sets, which was bad.)

It took me something like one month (since I wasn’t full-time on it) to jot down the new concepts and a (hopefully) good algorithm on paper. Then, for months, I’d try to implement it, but each time, something would feel really wrong, and I’d trash my new implementation. Well, that was till some days ago, when somehow, I got it right, and so the new xmltools were born!

XML intervals

So all this boils down to the following fundamental change: XML node sets have been replaced by XML intervals, as the core objects the xmltools operate on. An XML interval is really just that: a bunch of adjacent siblings and all their children. You should read the included xmltools(7) manual page for more information, from a user’s point of view.

From a more theoretical viewpoint, intervals have two important advantages over sets:

  1. Inserting or removing an XML interval in an XML tree keeps it a tree. In other words, XML trees are stable by insertion and deletion of arbitrary XML intervals. (It may not have any meaning anymore, but remember xmltools are about syntax, not semantics.)

  2. XML intervals map (injectively) to byte intervals in the XML file, hence all editing operations on XML intervals (sequences of insertions and deletions) can be expressed as operations on sequences of bytes in the file itself.

The first property is what I am convinced will make xmlsed possible. Previously, with node sets, it was a nightmare to give proper semantics to the various operations. But now, we have simple objects that still retain a great deal of expressivity but can be manipulated easily, with simple rules.

Although I do not use the second property yet, I’m definitely thinking about introducing an API for that sometime in the (not quite near) future.

The idea is that you run the XML matcher on a file mapped in memory (mapping new chunks as needed), and get a program script consisting of byte operations, that you can then run very efficiently on the memory-mapped file! Instead of printing an interval, you’d just copy over the entire memory region. Instead of deleting an interval, you’d just ignore that chunk. You get the idea. It could be used for efficient processing of massive data with complex transformation rules, but we’re not quite there, yet. :)

xmltools and NetBSD

As I’ve discussed with David (<dyoung>), I intend to import my core library, as well as xmlgrep into base sometime in the near future (once it has undergone more thorough testing). It depends on Expat for XML parsing, so I’ll need to import that too.

After much thinking, I think I’ll settle with a subdirectory in src/external/bsd, though I use the NetBSD build system, as I want to keep my development model using Git, and the ability to easily make packages of all my tools for non-NetBSD platforms.

The new code base uses (Net)BSD idioms more extensively than the old one (because I’ve got the time to read more NetBSD code since I started in May of last year). (E.g. I’ve followed David’s suggestion that I should make use of sys/queue.h.) Quite surprisingly, it compiles almost as is on GNU/Linux, with three additional defines and two trivial functions. I do provide a simple GNU makefile in addition to the default NetBSD build system makefiles, for GNU/Linux builds. Of course, you’re welcome to try to build it on other systems I don’t have access to!

What the future holds

Well, what’ll come next depends on a lot of factors, some of which are completely outside of my control. But there is one thing you can do: please download, compile, read the man pages, and test xmlgrep and give me some feedback, so I can fix bugs!

$ git clone git://git.huoc.org/regxml.git

Or through Git-Web: http://git.huoc.org/?p=regxml.git;a=summary

As far as the xmltools project is concerned, I’m now going to concentrate on xmlsed. I already have an idea of the implementation, and I think it’s not going to be too hard… hopefully, I’m not wrong. :)

P.S.: Oh, I forgot to talk about the new algorithm. Well, to get the idea, it’s basically an automaton backed by a backtracking memoizing work-queue-based interpreter, which roams the XML tree and try to match intervals. You should read the xmltools(7) man page for a high-level overview of the technique. Maybe I’ll write something more consistent on the subject later…

[ tag:blog.huoc.org,2009:posts/48 ]
View comments · Comment

/\ \/

xmlsed prototype

#. By rz0. Posted on 10/12 2009 at 10:30 PM. 5 comments.
api xmlsed xmltools

Following David Young’s post on the official NetBSD blog, here’s my own small announcement: xmlsed is here, or at least a somewhat working first version. It’s in the Git repository, under the master branch (everything’s been merged back into master some time ago).

It’s mostly untested, and the exact interface/behavior is likely to change, but if you’re interested in helping out, please read the man pages included in the distribution (xmlsed(1) and xml_pattern(7) should get you started) and play a bit with some XML data (you can find sample files in tests/data/).

So, what does my xmlsed do? Basically, it’s somewhat like sed(1) except it does only one of the many things sed can do (though it’s probably the most popular feature of sed): it replaces things with other things, where, in our case, a "thing" is one or more nodes.

It is based on the new event-handling mini internal framework and uses the same patterns as xmlgrep (and the same matching code), augmented with the register-binding operator >= which lets you put captured nodes into a named variable, called a register. Then you can put nodes of the form <$REGNAME/> in your replacement templates, and voilà! The register’s contents gets inserted in place of the <$REGNAME/>. Something like xmlsed '((a#)>=a%(b#)>=b) <$b/><$a/>' should swap around two consecutive nodes a and b… well, I very much hope it does.

Well, that’s for the good part. Now, there are many things left unresolved in xmlsed, mainly the fact that its behavior is not consistent with xmlgrep. In xmlsed, if you replace a node, it also deletes all of its descendants, whereas when you select a node in xmlgrep (or in a capture in xmlsed), it comes alone, without any of its descendants (except attributes). In fact, you can even do a lot more weird stuffs in selections like selecting and grouping together non directly connected parts of the tree. To sum up: selection is way more powerful than replacement, and by default, they don’t really look alike (and they aren’t quite implemented the same way either…).

I’m thinking about rationalizing all this, along with the syntax. Through the many feature additions (as David and I decided we needed more power), selection patterns have probably become too convoluted, unnecessarily complex and fine-grained where it doesn’t matter. However, we should not give up on all that flexibility and probably bring some over to the replacement mechanism. It’ll also be a good opportunity to revise the syntax (more XPath-isms?) and migrate to some automated parser generator, one that can create reentrant and push-style parsers. I have that big patch of NetBSD yacc doing just that, waiting in my Git repository, but it probably needs more polishing before I can hope to submit it to the mailing lists to get myself flamed for doing things on my own, but oh well, we’ll see.

Now’s not the time for that yet. First, I need to test xmlsed more thoroughly to ensure that the implementation underlying the syntax really works.

Also, other things that I’ll want/need to do once I’m done with xmlsed (to the point where it runs well enough, doing something useful) are, in no particular order:

  • Rewrite the matching code to be more generic, simple, and readable: the idea is to split various constructs into more elementary blocks (I haven’t decided exactly what they would be, but I’m thinking about it) and have some kind of automaton feed on these instructions and move accordingly.

  • Improve the data structures in use, both for efficiency reasons and to make it possible to (theorically) run multiple instances of the matcher on the same stream of data, hopefully opening the way for some more concurrency. This last point is almost possible now, but not quite yet; there are still some matching data that remain in the node itself while they shouldn’t.

  • Implement the amended algorithm I’ve devised that should (if I’m correct…) eliminate the "bad" parts of the complexity that make it theorically inferior to the transducer network algorithm, while taking somewhat more memory (though the asymptotic bounds on memory consumption should stay the same).

  • Do something about locales; also handle integration into base, and packaging for non-NetBSD systems. In short: better integration.

Then of course, there’ll also be work on the remaining tools: some kind of xmljoin is probably next after xmlsed. In which order I address the various points, I don’t know yet; it’ll probably depend on what I feel is more urgent, and what I feel like doing first. :)

[ tag:blog.huoc.org,2009:posts/33 ]
View comments · Comment

/\

xmlsed preview: writing XML

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/25 2009 at 12:48 PM. No comments.
<. More benchmarks: the bitvopt branch
>. xmltools update: new I/O layer and further plans
templates xmlsed xmltools

One might be tempted to think that, while parsing can be a complex task involving various competing models, printing is straightforward and should be the easy part of the job. This is not quite true however, as David and I have been in disagreement for some time already over the exact semantics and syntax of a potential XML printing subsystem of xmltools.

This post is all about the design of the printing language in xmltools and the underlying model. I also present at the end my latest hybrid proposal, that I’ve submitted to David for approval earlier today, and which I hope will bring satisfaction to all parties. If you have some opinion on this topic, do not hesitate to e-mail me! (My e-mail is my first name followed by my last name with all spaces replaced by dots; it’s written in French in the footer so you might not be able to decode it.)

The basic problem can be described as follows: we have a source tree, the input document and we want to transform it by adding or removing nodes in some places. Simple? Not quite.

Select & transform

The first design I proposed is based on sed(1): we select some node in the tree using a tree pattern (same as in my xmlgrep) and then transform it using some actions: insert, append, replace, etc. Only these needed to be adapted to XML.

My idea was then to use PYX/ESIS-inspired actions, which also happen to model closely my internal API. These consisted of a one-character prefix indicating the thing to print, followed by contents (depending on the prefix). Prefixes were:

( opening tag
) closing tag
@ attribute
. text
! comment
? junk

This is a very simple approach but David objected that it was unnatural to write XML using some language which was not XML.

Advantages
  • Very simple and straightforward.
  • Completely symmetrical to the way patterns are matched.
Disadvantages
  • Code used to write XML does not look like the output.
  • Not intuitive to read as data.

Decorate & pull

Another approach, which is used most notably by XSLT, is to have a template, which itself is written as XML, in which some elements from the source document are pulled.

Let me say it upfront in case anybody is not clear on this: pulling is not a stream technique, so this model as is cannot be implemented easily (some subsets of XSLT as stream transformations have been researched, but since it is from the start not a stream-oriented technique, it’s complex and IMHO not worth it).

David first suggested using a template as a way of having printing-heavy script actually look like their output. He observed plain XML could be written directly while some special elements, e.g. xmlsed:addattr could be used to print things that cannot be represented directly in XML. Indeed, there is no XML syntax to write just an attribute without the surrounding tag, yet there are cases where we might want to add an attribute to an existing element.

I criticized this idea based on the fact that reserving a namespace was intrusive and that using XML to represent scripting elements was too verbose.

But the main point was that pull-style transformation was never an option.

Advantages
  • XML in script remains in output.
  • Very intuitive and easy to read.
Disadvantages
  • Model is not adapted to streaming.
  • Too verbose.

Towards a hybrid model: XML syntax for select & transform

Given the stream-unfriendly nature of the template-based model, David came up with the idea to represent actions of the select & transform model as XML: XML opening and closing tags could be used to represent the corresponding printing actions (respectively ( and )) but for the rest, we were still stuck with the way-too-verbose XSLT-like <xmlsed:whatever> elements.

I was still very dissatisfied with the verbosity of having to write <xmlsed:whatever/> just to add an attribute or specify where to insert the current node (before, after, inside the plain XML part), but the concept itself, of using a template, was quite nice.

Advantages
  • XML in script remains in output.
  • Very intuitive and easy to read.
  • Still reasonably simple to implement.
Disadvantages
  • Still too verbose.

A hybrid model

Today, after more brainstorming, I finally decided on the following model, and unless David is really against it (which I seriously doubt), I think it will make it into xmlsed one way or the other: the idea is the same as that of the previous attempt, use a template which actually describes actions inferred from the structure of XML itself.

Only, this time, we do not use special elements to indicate actions anymore, everything is inferred from the structure. The trick to do this is to have a virtual element represent the current node; it can have any name (it will be user-settable), let’s call it : by default. We can place it wherever we want in the template. And to add an attribute (resp. a child), we just have to add an attribute (resp. a child) to that virtual element. The syntax remains reasonably compact (even optimally compact given we’re using XML as the basis for our template) and the script now very closely models the output, even more than XSLT scripts!

The idea was inspired from xargs(1) and its -I option. Again, Unix saved the day… well, not quite, but still, I think it is a nice and practical compromise. :)

[ tag:blog.huoc.org,2009:posts/10 ]
No comments · Comment