[Atom] [Mail] [Twitter]
Links: git · hacks · misc · cabal · buzz! · about +
Today’s menu
\/

In search of a good pull-style XML parser

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/09 2009 at 11:07 AM. No comments.
>. XML as a tree representation
api parsing xml xmltools

Recently, I’ve been working with expat extensively, as part of my Google Summer of Code project, and have come to the realization that there is, IMHO, no good well-maintained lightweight stream-oriented XML parser.

Of course, one would think expat, at this point, and I thought so too, when I started, only to find myself now locked with an expat-like design, which is both inefficient and unnecessarily complicated (for what I want to do).

So, what is this expat design I am talking about? It is the event-driven (callback-oriented) model, what the XML guys out there call push parsers, though I personally think it doesn’t make much sense to call it that way, so I won’t.

While event-driven parsing is good for some tasks — awk(1) is a really clever tool — it really fails on complex parsing tasks such as those required by my xmltools that rely on partial reprocessing of the document tree.

Reprocessing of the document tree basically means we have to keep an in-memory copy of the parts we want to traverse more than once. Of course, expat can (help us) do that. Using expat, the basic idea is to have two input sources which both call the event handlers. Then, whenever we need to reprocess some part of the tree, we just traverse our in-memory copy and feed the information to the handlers as if they were receiving these from expat.

Now, this seems good at first, but the fact is we now have two parsers: expat and our tree parser. Moreover, they don’t share an identical behavior: on the expat side, we may need to build new nodes to put in the partial tree (for future reprocessing), while in the in-memory tree parser, we may need to free some nodes which are no longer needed. But even in the expat handling code there may be nodes we have to keep (usually the current branch, from the root to the current node); this in turn implies we sometimes have to free some nodes as well. Now we have bits and pieces of the in-memory tree management code everywhere. Things start to look messy…

Besides, when do we call the surrogate parser? The issue here is we don’t want expat to feed new data while we are in the middle of (or more precisely before we can even start) a reprocessing operation. The expat manual points at the XML_StopParser and XML_ResumeParser functions. These seem to be exactly what we need, right? Wrong. Indeed, a more careful reading of the documentation reveals parsing does not stop immediately after the handler that called XML_StopParser; there may be an unspecified number of additional events between the end of that one and the moment the parser actually stops. This makes these routines pretty much useless. Actually, we need to use the secondary input system right on top of expat itself, in the middle of our callbacks, whenever we need this functionality. This is not pretty to say the least…

But there is a more elegant solution: the idea is to have the internal tree management module drive the process. It maintains the in-memory tree and calls the event code on its nodes whenever necessary, possibly going through loops or any patterns that suit its needs, for that matter. Then, when it requires a node that is not present in the tree, it calls upon the input module which returns the next node in the stream. This is the so-called pull method. There really is nothing special about it; in fact, it’s the way most I/O libraries out there are built. Why XML had to be different, I don’t know.

As I’ve said, to the extent of my knowledge, there is no lightweight and well-maintained (I mean not some two-year-old experiment by some obscure hacker, though I respect these things, I don’t have time right now to play with such codes) C library which implements the full XML standard and this pull metaphor. If anyone out there knows, do not hesitate to let me know.

[ tag:blog.huoc.org,2009:posts/3 ]
No comments · Comment

/\ \/

XML as a tree representation

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/10 2009 at 03:33 PM. No comments.
<. In search of a good pull-style XML parser
>. xmlgrep toy benchmarks
trees xml xmltools

xmltools is a project whose ultimate goal is to bring Unix command-line efficiency to the XML world (and not, as many might think, to bring XML into the Unix world, which is being done already, though not by my fault). The idea is to treat XML as a generic tree representation format.

This seems very straightforward; in fact, one could argue that XML was designed from the start to be a generic tree representation. However, XML has a number of problems that make it hard to use in this role.

The first such problems is doctype declarations. While the bare XML structure, which we see most of the time, is very simple, doctype declarations and their semantics are both unneeded and unwanted in tree (syntax) manipulation programs. It would be fine if doctype support was optional for non-validating XML processors, but as I understand it, the XML standard mandates support for a portion of it: entity declarations and references, and default attribute values.

The second issue is with white space handling: XML does not include any default behavior regarding white space. XML parsers must pass all white space to the application, which is responsible for some sensible default processing. The xml:space attribute can give a hint to the application but as far as I know, it’s not in wide use.

This all led me to develop a set of compatibility rules which precisely describe what one should expect from xmltools, no matter which backend is used (in case we move away from expat, which I plan to do, eventually). This subset is sufficient to describe any tree-like data set.

The rules will be maintained in the doc/compatxml.text file, in the xmltools source tree. For convenience, I’ve reproduced the current version below:

  1. Do not use doctype declarations for any purpose other than validation. In particular, do not rely on doctype declarations to provide default attributes and do not use internal entities for arbitrary substitutions; if you wanted to save typing, you wouldn’t be using XML. Use of numeric and predefined entities is OK. Use of directly-encoded characters is preferred.

  2. Do not presuppose the XML processor has access to any resource beside your file: do not use external entities.

  3. Do not assume the XML processor is able to handle multiple character sets: all the documents and other data supplied to the program should use the same character set. Also, set your locale appropriately.

  4. Unless xml:space is set, assume all white-space only text elements are ignored by the XML processor.

  5. Set xml:space to preserve whenever white space not kept by the previous rule must be preserved. As stated in the first rule, do not rely on doctype declarations to implicitly set this attribute. Please note however that the fourth rule works well for most purposes, including processing of HTML pre elements which only contain verbatim text.

[ tag:blog.huoc.org,2009:posts/5 ]
No comments · Comment

/\ \/

xmlgrep by examples: playing with Atom

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Updated on 06/22 2009 at 02:30 PM. No comments.
<. Improving performances of xmlgrep
>. More benchmarks: the bitvopt branch
atom tutorial xml xmltools

Edited 22.06. Incompatible syntax changes and more examples.

Since Matthew Sporleder on tech-userlevel has (implicitly) suggested that xmlgrep could be used to dissect Atom feeds, but was a bit lost at how to do it exactly, I thought I’d post a little demo of a console session playing with an Atom file and xmlgrep (as well as some other command-line tools).

First, let’s fetch the feed:

$ ftp http://mspo.com/blog/atom.xml
[...]
As a side effect of xmlgrep, we might want to indent the XML file to make it human-readable:
$ xmlgrep '*' atom.xml | more
List all posts in the NetBSD category with their IDs:
$ xmlgrep -x 'entry[category/@term=NetBSD]/(title|id)/.' atom.xml
tag:blogger.com,1999:blog-6347225410141611306.post-1131641169617411392
NetBSD quotas - quickstart
tag:blogger.com,1999:blog-6347225410141611306.post-1939815769827620970
NetBSD device drivers - easier than you might think
[...]

Those of you yet unfamiliar with the syntax might have some trouble understanding. The previous pattern could be read "select a text child of an id or title element, itself a child of an entry element, which contains a category element which has an attribute child term equal to NetBSD." Step by step, you should notice that a[b] is read "a such that b", | stands for "or", / for "child of", @ for "attribute", . for "text", and the braces are used for grouping purposes.

Now, let’s select a post by ID:

$ xmlgrep -x 'entry[id/.~"post-1939"]#' atom.xml
Or, select a post by title and view its contents using w3m:
$ xmlgrep -x 'entry[title/.~"device drivers"]/content/.' atom.xml |
> sed -e 's/&lt;/</g' -e 's/&gt;/>/g' -e 's/&amp;/\&/g' |
> w3m -T text/html

As a side note, I should mention that up to now we have used subpatterns quite a lot. This is because the Atom feed specification does not force an order (or does it?) on the children of entry elements. With more precise knowledge of the order of elements relative to each other, we could have optimized the pattern to use % and %% where possible. Subpatterns are costly, but for data sets this size, we probably don’t care much.

Let’s print all entry titles which date from March 2009 using the fact that we know the updated element comes before the title one:

$ xmlgrep -x 'entry/updated[.~"^2009-03"]%%title/.' atom.xml
A friend of mine told me it would be useful to have arithmetic predicates. I think they will feature in xmltools sooner or later, but even without them, it is still possible to do some simple statistics, by combining the results with awk(1), for example. The following one-liner counts the number of posts that have no older than March:
$ xmlgrep -x 'entry/updated/.' atom.xml | awk -F - '$2>=3' | wc -l

That’s it; I hope this will help people who want to get started with xmlgrep. If you have other good examples you’d want me to elaborate, do not hesitate to send me a mail!

[ tag:blog.huoc.org,2009:posts/8 ]
No comments · Comment

/\

xmltools update: new I/O layer and further plans

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 07/08 2009 at 03:45 PM. No comments.
<. xmlsed preview: writing XML
>. xmltools implementation: automata and backtracking
io xml xmltools

It has been nearly two weeks since my last update on my plans for xmlsed. Since that time, David and I have come to an agreement on the (hypothetical) syntax and my hybrid model has been retained.

Rewriting the I/O module

Although I now have a fair idea of what xmlsed should look like and how to implement it, some elements introduced after the recent discussions with my mentor required changes to the existing code.

In particular, the new template syntax required an XML parser able to handle partial XML documents. Besides, I wanted some syntactic sugar on top of XML to make writing templates easier. Expat being a well-behaved XML parser is quite strict about the syntax and there is no easy way to work around that. Though I did spend some days trying, at first, I eventually gave up, as it proved hard to write, let alone maintain in the future.

At the same time, I began to realize the original I/O abstraction I designed was showing its limits. I initially wanted to be able to read and write multiple tree representation formats, but in the end, the need to fully support XML, with all its specifics (doctypes, PIs, etc.), has convinced me to drop the idea and focus on XML alone.

All these reasons led me to rewrite the I/O layer (almost) completely. This work is being committed to yet another temporary branch: xmlpush.

Maybe the most visible change is that multi-backend support was dropped and the I/O module now consists of only two drivers: the Expat-based parser (the strict parser) and a home-made loose parser.

All tools now support two modes: the strict mode (-s flag) and the default loose mode.

  • In the strict mode, compliance with the XML standard is a priority: all extensions are disabled, including multi-root and partial documents (which was implemented with Expat as an ugly kludge before, so it was removed), the XML prolog is honored (the specified encoding is used and entity declarations are parsed), and stricter rules are enforced (e.g. in names).

  • In the loose mode, extensions are supported and the XML prolog is ignored (instead, we use the system locale).

The encoding issue

Although I’ve just stated how encodings will be handled, at the moment, I have not written any code to suppor that yet. Actually, it’s not a simple matter.

First of all, the XML standard mandates support of Unicode. For one thing, XML character entities (&xxxx;) are references to character codes from the Unicode tables. This makes support for Unicode pretty much mandatory.

In order to handle this, the Expat people have chosen to serve all data as UTF-8 (or UTF-16, depending on the build-time configuration), no matter what input encoding is used. This should have given you a hint: Expat has its own encoding conversion engine.

Now, even though XML documents can specify an encoding, and so it would be alright to always use UTF-8, that is not an acceptable solution in the real world: Unix users expect their programs to comply to the current locale. But Expat does not care one way or the other about, or integrate with, the locale system.

Fortunately, NetBSD has iconv(3) (though neither FreeBSD nor OpenBSD does, apparently, which will be a problem in making my programs portable; there seems to be an ongoing GSoC effort to port NetBSD libiconv to FreeBSD though), so I plan to run everything through that on input and again on output. Sounds like a waste of resources? Well, don’t blame me.

However, that’s not all: there is also the loose parser. Since this one was written by me, and I had no reason to force UTF-8 everywhere, it does no conversion and assumes all sources are in the current locale. The only problem will be for character entities (not currently implemented), but I intend to localize the Unicode-to-locale transformation to these, since it is costly.

Remember the push-style vs event-driven parser debate?

Well, it wasn’t really a debate, but maybe some of you remember that I posted some weeks ago a ticket on this blog about how I grew dissatisfied with the rigid event-driven behavior of Expat and wanted a push-style parser.

I took the occasion this time to make my wish come true (almost) and the new I/O design is mostly push-style. I say mostly because I’ve made some compromises in order not to impair performances.

At first, I thought about having the parser run on a fragment of input text and build a queue of events to be fed to the application one by one, on a on-demand basis. But then I realized I could just use the internal tree directly as my event queue, and have the application read that. Since we have support for look-ahead in the matching engine (and hence need to keep a partial internal tree in any case) and most tools will use that, this effectively moved the tree building code from the matcher to the I/O module, at the same time eliminating direct event responses (a bit less than 500 lines of code). This last point needs some clarification: sure we do use a little more memory (but honestly that just doesn’t show) since we build the tree first and only then process it, but this is limited to how many nodes can be represented in one read buffer, which means it’s mostly insignificant. But we have gained what I think is far more important: an unified implementation of the matcher (which is the most sensitive part).

Further plans

At the moment, the new parser does not support things such as entity declarations and I am still pondering whether to write code for that or not. In any case, it would be a good idea to have a xmlcat(1) utility which fully supports XML, including external entities, with the ability to fetch and include external documents (fetch(3)?) and substitute references. But let’s leave this discussion for another time.

As for the locale support, I think it will come after xmlsed is somewhat ready (i.e. as xmlgrep is right now). So for now, some more testing needs to be done on the new I/O components, and when this is done, I will resume work on xmlsed proper.

[ tag:blog.huoc.org,2009:posts/15 ]
No comments · Comment