xmlsed prototype
#.
By rz0. Posted on 10/12 2009 at 10:30 PM.
5 comments.
Following David Young’s post on the official NetBSD blog, here’s my own small announcement: xmlsed is here, or at least a somewhat working first version. It’s in the Git repository, under the master branch (everything’s been merged back into master some time ago).
It’s mostly untested, and the exact interface/behavior is likely to
change, but if you’re interested in helping out, please read the man
pages included in the distribution (xmlsed(1) and
xml_pattern(7) should get you started) and play a bit with some
XML data (you can find sample files in tests/data/).
So, what does my xmlsed do? Basically, it’s somewhat like sed(1) except it does only one of the many things sed can do (though it’s probably the most popular feature of sed): it replaces things with other things, where, in our case, a "thing" is one or more nodes.
It is based on the new event-handling mini internal framework and uses
the same patterns as xmlgrep (and the same matching code), augmented
with the register-binding operator >= which lets you put captured
nodes into a named variable, called a register. Then you can put nodes
of the form <$REGNAME/> in your replacement templates, and voilà!
The register’s contents gets inserted in place of the
<$REGNAME/>. Something like xmlsed '((a#)>=a%(b#)>=b) <$b/><$a/>'
should swap around two consecutive nodes a and b… well, I very
much hope it does.
Well, that’s for the good part. Now, there are many things left unresolved in xmlsed, mainly the fact that its behavior is not consistent with xmlgrep. In xmlsed, if you replace a node, it also deletes all of its descendants, whereas when you select a node in xmlgrep (or in a capture in xmlsed), it comes alone, without any of its descendants (except attributes). In fact, you can even do a lot more weird stuffs in selections like selecting and grouping together non directly connected parts of the tree. To sum up: selection is way more powerful than replacement, and by default, they don’t really look alike (and they aren’t quite implemented the same way either…).
I’m thinking about rationalizing all this, along with the syntax. Through the many feature additions (as David and I decided we needed more power), selection patterns have probably become too convoluted, unnecessarily complex and fine-grained where it doesn’t matter. However, we should not give up on all that flexibility and probably bring some over to the replacement mechanism. It’ll also be a good opportunity to revise the syntax (more XPath-isms?) and migrate to some automated parser generator, one that can create reentrant and push-style parsers. I have that big patch of NetBSD yacc doing just that, waiting in my Git repository, but it probably needs more polishing before I can hope to submit it to the mailing lists to get myself flamed for doing things on my own, but oh well, we’ll see.
Now’s not the time for that yet. First, I need to test xmlsed more thoroughly to ensure that the implementation underlying the syntax really works.
Also, other things that I’ll want/need to do once I’m done with xmlsed (to the point where it runs well enough, doing something useful) are, in no particular order:
Rewrite the matching code to be more generic, simple, and readable: the idea is to split various constructs into more elementary blocks (I haven’t decided exactly what they would be, but I’m thinking about it) and have some kind of automaton feed on these instructions and move accordingly.
Improve the data structures in use, both for efficiency reasons and to make it possible to (theorically) run multiple instances of the matcher on the same stream of data, hopefully opening the way for some more concurrency. This last point is almost possible now, but not quite yet; there are still some matching data that remain in the node itself while they shouldn’t.
Implement the amended algorithm I’ve devised that should (if I’m correct…) eliminate the "bad" parts of the complexity that make it theorically inferior to the transducer network algorithm, while taking somewhat more memory (though the asymptotic bounds on memory consumption should stay the same).
Do something about locales; also handle integration into base, and packaging for non-NetBSD systems. In short: better integration.
Then of course, there’ll also be work on the remaining tools: some kind of xmljoin is probably next after xmlsed. In which order I address the various points, I don’t know yet; it’ll probably depend on what I feel is more urgent, and what I feel like doing first. :)