[Talk] [Atom] [Talk Atom]    [Git] [Mail] [Reload]
Links: mdown · hacks · misc · cabal · about +
Today’s menu
\/

xmlsed prototype

#. By rz0. Posted on 10/12 2009 at 10:30 PM. 5 comments.
api xmlsed xmltools

Following David Young’s post on the official NetBSD blog, here’s my own small announcement: xmlsed is here, or at least a somewhat working first version. It’s in the Git repository, under the master branch (everything’s been merged back into master some time ago).

It’s mostly untested, and the exact interface/behavior is likely to change, but if you’re interested in helping out, please read the man pages included in the distribution (xmlsed(1) and xml_pattern(7) should get you started) and play a bit with some XML data (you can find sample files in tests/data/).

So, what does my xmlsed do? Basically, it’s somewhat like sed(1) except it does only one of the many things sed can do (though it’s probably the most popular feature of sed): it replaces things with other things, where, in our case, a "thing" is one or more nodes.

It is based on the new event-handling mini internal framework and uses the same patterns as xmlgrep (and the same matching code), augmented with the register-binding operator >= which lets you put captured nodes into a named variable, called a register. Then you can put nodes of the form <$REGNAME/> in your replacement templates, and voilà! The register’s contents gets inserted in place of the <$REGNAME/>. Something like xmlsed '((a#)>=a%(b#)>=b) <$b/><$a/>' should swap around two consecutive nodes a and b… well, I very much hope it does.

Well, that’s for the good part. Now, there are many things left unresolved in xmlsed, mainly the fact that its behavior is not consistent with xmlgrep. In xmlsed, if you replace a node, it also deletes all of its descendants, whereas when you select a node in xmlgrep (or in a capture in xmlsed), it comes alone, without any of its descendants (except attributes). In fact, you can even do a lot more weird stuffs in selections like selecting and grouping together non directly connected parts of the tree. To sum up: selection is way more powerful than replacement, and by default, they don’t really look alike (and they aren’t quite implemented the same way either…).

I’m thinking about rationalizing all this, along with the syntax. Through the many feature additions (as David and I decided we needed more power), selection patterns have probably become too convoluted, unnecessarily complex and fine-grained where it doesn’t matter. However, we should not give up on all that flexibility and probably bring some over to the replacement mechanism. It’ll also be a good opportunity to revise the syntax (more XPath-isms?) and migrate to some automated parser generator, one that can create reentrant and push-style parsers. I have that big patch of NetBSD yacc doing just that, waiting in my Git repository, but it probably needs more polishing before I can hope to submit it to the mailing lists to get myself flamed for doing things on my own, but oh well, we’ll see.

Now’s not the time for that yet. First, I need to test xmlsed more thoroughly to ensure that the implementation underlying the syntax really works.

Also, other things that I’ll want/need to do once I’m done with xmlsed (to the point where it runs well enough, doing something useful) are, in no particular order:

  • Rewrite the matching code to be more generic, simple, and readable: the idea is to split various constructs into more elementary blocks (I haven’t decided exactly what they would be, but I’m thinking about it) and have some kind of automaton feed on these instructions and move accordingly.

  • Improve the data structures in use, both for efficiency reasons and to make it possible to (theorically) run multiple instances of the matcher on the same stream of data, hopefully opening the way for some more concurrency. This last point is almost possible now, but not quite yet; there are still some matching data that remain in the node itself while they shouldn’t.

  • Implement the amended algorithm I’ve devised that should (if I’m correct…) eliminate the "bad" parts of the complexity that make it theorically inferior to the transducer network algorithm, while taking somewhat more memory (though the asymptotic bounds on memory consumption should stay the same).

  • Do something about locales; also handle integration into base, and packaging for non-NetBSD systems. In short: better integration.

Then of course, there’ll also be work on the remaining tools: some kind of xmljoin is probably next after xmlsed. In which order I address the various points, I don’t know yet; it’ll probably depend on what I feel is more urgent, and what I feel like doing first. :)

[ tag:blog.huoc.org,2009:posts/33 ]
View comments · Comment

/\ \/

xmltools: after the GSoC

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 09/07 2009 at 05:14 PM. No comments.
<. xmltools: what's done, what's left
api atf xmltools

Since the end of the Summer of Code, I have been busy doing various miscellaneous tasks around my xmltools, so that it be ready for inclusion into the NetBSD tree some time in the future. This is a short summary of what has happened (just so people know that my project’s not dead, I’ve not vanished, or something along these lines); please take a look at the code repo for details:

ATF test suites

I’ve finally decided to convert my bunch of shell scripts to ATF tests. These are basic tests which check the various constructs of the pattern language. There are 46 tests, three of which are optional (they are made against rather big data sets that I do not distribute with the sources). These include a test suite centered around property list matching. If you have relevant test cases and would like to see them included, please send them over.

Library API

Some people had been asking me if there would be a library, well, yes, and I’ve libified all the existing code in anticipation of that, and also because I wanted to eventually commit somthing clean to the NetBSD tree. All routines have been made to support proper error reporting, and expose clean interfaces. The library is divided into four main sub-namespaces:α

  • hhxml_chunk_* : XML in-memory tree manipulations.
  • hhxml_stream_* : XML streams and (push-style) parsers. The streams have a hybrid design: they can be used to read nodes one at a time in a pure stream fashion or to generate partial in-memory trees. See misc/atomheadlines.c in the source distribution for an example.
  • hhxml_path_* : patterns and matching. This one is incomplete and the old match.c will need to be converted to fit into the new API.
  • hhxml_pprint_* : XML pretty printer. The implementation is also mostly incomplete at this time: it only supports printing to stdout, well, because the command-line tools don’t require any more than that.

The various parts of the API will be completed as I go about writing the tools themselves, but I wanted to have clean well-defined interfaces to build onto.

α: By the way, "hhxml" stands for "(the) very short XML (library)".

Non-visible changes to the pattern matching engine and plans

Well, I’ll write more on this one when I get the time, but I’ve been looking into other implementations and theories, especially SPEX, which boasts algorithms with good complexities (better than what is currently implemented in my tools). I have plans for improvements to the engine which I believe, if I’m not wrong, would bring us similar performance, but that will have to wait, since it is not critically important, at the moment, in my opinion.

That’s all for today!

[ tag:blog.huoc.org,2009:posts/28 ]
No comments · Comment

/\

In search of a good pull-style XML parser

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/09 2009 at 11:07 AM. No comments.
>. XML as a tree representation
api parsing xml xmltools

Recently, I’ve been working with expat extensively, as part of my Google Summer of Code project, and have come to the realization that there is, IMHO, no good well-maintained lightweight stream-oriented XML parser.

Of course, one would think expat, at this point, and I thought so too, when I started, only to find myself now locked with an expat-like design, which is both inefficient and unnecessarily complicated (for what I want to do).

So, what is this expat design I am talking about? It is the event-driven (callback-oriented) model, what the XML guys out there call push parsers, though I personally think it doesn’t make much sense to call it that way, so I won’t.

While event-driven parsing is good for some tasks — awk(1) is a really clever tool — it really fails on complex parsing tasks such as those required by my xmltools that rely on partial reprocessing of the document tree.

Reprocessing of the document tree basically means we have to keep an in-memory copy of the parts we want to traverse more than once. Of course, expat can (help us) do that. Using expat, the basic idea is to have two input sources which both call the event handlers. Then, whenever we need to reprocess some part of the tree, we just traverse our in-memory copy and feed the information to the handlers as if they were receiving these from expat.

Now, this seems good at first, but the fact is we now have two parsers: expat and our tree parser. Moreover, they don’t share an identical behavior: on the expat side, we may need to build new nodes to put in the partial tree (for future reprocessing), while in the in-memory tree parser, we may need to free some nodes which are no longer needed. But even in the expat handling code there may be nodes we have to keep (usually the current branch, from the root to the current node); this in turn implies we sometimes have to free some nodes as well. Now we have bits and pieces of the in-memory tree management code everywhere. Things start to look messy…

Besides, when do we call the surrogate parser? The issue here is we don’t want expat to feed new data while we are in the middle of (or more precisely before we can even start) a reprocessing operation. The expat manual points at the XML_StopParser and XML_ResumeParser functions. These seem to be exactly what we need, right? Wrong. Indeed, a more careful reading of the documentation reveals parsing does not stop immediately after the handler that called XML_StopParser; there may be an unspecified number of additional events between the end of that one and the moment the parser actually stops. This makes these routines pretty much useless. Actually, we need to use the secondary input system right on top of expat itself, in the middle of our callbacks, whenever we need this functionality. This is not pretty to say the least…

But there is a more elegant solution: the idea is to have the internal tree management module drive the process. It maintains the in-memory tree and calls the event code on its nodes whenever necessary, possibly going through loops or any patterns that suit its needs, for that matter. Then, when it requires a node that is not present in the tree, it calls upon the input module which returns the next node in the stream. This is the so-called pull method. There really is nothing special about it; in fact, it’s the way most I/O libraries out there are built. Why XML had to be different, I don’t know.

As I’ve said, to the extent of my knowledge, there is no lightweight and well-maintained (I mean not some two-year-old experiment by some obscure hacker, though I respect these things, I don’t have time right now to play with such codes) C library which implements the full XML standard and this pull metaphor. If anyone out there knows, do not hesitate to let me know.

[ tag:blog.huoc.org,2009:posts/3 ]
No comments · Comment