Talk   Atom Mail Git
Links: mdown · hacks · misc · cabal · about +

Ours & Hippy — le blog [xmltools]

Today’s menu · · ·

xmlsed prototype

#. By Nhat Minh Lê. Posted on 10/12 2009 at 10:30 PM. 5 comments.

Following David Young’s post on the official NetBSD blog, here’s my own small announcement: xmlsed is here, or at least a somewhat working first version. It’s in the Git repository, under the master branch (everything’s been merged back into master some time ago).

It’s mostly untested, and the exact interface/behavior is likely to change, but if you’re interested in helping out, please read the man pages included in the distribution (xmlsed(1) and xml_pattern(7) should get you started) and play a bit with some XML data (you can find sample files in tests/data/).

So, what does my xmlsed do? Basically, it’s somewhat like sed(1) except it does only one of the many things sed can do (though it’s probably the most popular feature of sed): it replaces things with other things, where, in our case, a "thing" is one or more nodes.

It is based on the new event-handling mini internal framework and uses the same patterns as xmlgrep (and the same matching code), augmented with the register-binding operator >= which lets you put captured nodes into a named variable, called a register. Then you can put nodes of the form <$REGNAME/> in your replacement templates, and voilà! The register’s contents gets inserted in place of the <$REGNAME/>. Something like xmlsed '((a#)>=a%(b#)>=b) <$b/><$a/>' should swap around two consecutive nodes a and b… well, I very much hope it does.

Well, that’s for the good part. Now, there are many things left unresolved in xmlsed, mainly the fact that its behavior is not consistent with xmlgrep. In xmlsed, if you replace a node, it also deletes all of its descendants, whereas when you select a node in xmlgrep (or in a capture in xmlsed), it comes alone, without any of its descendants (except attributes). In fact, you can even do a lot more weird stuffs in selections like selecting and grouping together non directly connected parts of the tree. To sum up: selection is way more powerful than replacement, and by default, they don’t really look alike (and they aren’t quite implemented the same way either…).

I’m thinking about rationalizing all this, along with the syntax. Through the many feature additions (as David and I decided we needed more power), selection patterns have probably become too convoluted, unnecessarily complex and fine-grained where it doesn’t matter. However, we should not give up on all that flexibility and probably bring some over to the replacement mechanism. It’ll also be a good opportunity to revise the syntax (more XPath-isms?) and migrate to some automated parser generator, one that can create reentrant and push-style parsers. I have that big patch of NetBSD yacc doing just that, waiting in my Git repository, but it probably needs more polishing before I can hope to submit it to the mailing lists to get myself flamed for doing things on my own, but oh well, we’ll see.

Now’s not the time for that yet. First, I need to test xmlsed more thoroughly to ensure that the implementation underlying the syntax really works.

Also, other things that I’ll want/need to do once I’m done with xmlsed (to the point where it runs well enough, doing something useful) are, in no particular order:

  • Rewrite the matching code to be more generic, simple, and readable: the idea is to split various constructs into more elementary blocks (I haven’t decided exactly what they would be, but I’m thinking about it) and have some kind of automaton feed on these instructions and move accordingly.

  • Improve the data structures in use, both for efficiency reasons and to make it possible to (theorically) run multiple instances of the matcher on the same stream of data, hopefully opening the way for some more concurrency. This last point is almost possible now, but not quite yet; there are still some matching data that remain in the node itself while they shouldn’t.

  • Implement the amended algorithm I’ve devised that should (if I’m correct…) eliminate the "bad" parts of the complexity that make it theorically inferior to the transducer network algorithm, while taking somewhat more memory (though the asymptotic bounds on memory consumption should stay the same).

  • Do something about locales; also handle integration into base, and packaging for non-NetBSD systems. In short: better integration.

Then of course, there’ll also be work on the remaining tools: some kind of xmljoin is probably next after xmlsed. In which order I address the various points, I don’t know yet; it’ll probably depend on what I feel is more urgent, and what I feel like doing first. :)

[ tag:blog.huoc.org,2009:posts/33 ]
View comments · Comment

xmltools: after the GSoC

#. By Nhat Minh Lê. Posted on 09/07 2009 at 05:14 PM. No comments.

Since the end of the Summer of Code, I have been busy doing various miscellaneous tasks around my xmltools, so that it be ready for inclusion into the NetBSD tree some time in the future. This is a short summary of what has happened (just so people know that my project’s not dead, I’ve not vanished, or something along these lines); please take a look at the code repo for details:

ATF test suites

I’ve finally decided to convert my bunch of shell scripts to ATF tests. These are basic tests which check the various constructs of the pattern language. There are 46 tests, three of which are optional (they are made against rather big data sets that I do not distribute with the sources). These include a test suite centered around property list matching. If you have relevant test cases and would like to see them included, please send them over.

Library API

Some people had been asking me if there would be a library, well, yes, and I’ve libified all the existing code in anticipation of that, and also because I wanted to eventually commit somthing clean to the NetBSD tree. All routines have been made to support proper error reporting, and expose clean interfaces. The library is divided into four main sub-namespaces:α

  • hhxml_chunk_* : XML in-memory tree manipulations.
  • hhxml_stream_* : XML streams and (push-style) parsers. The streams have a hybrid design: they can be used to read nodes one at a time in a pure stream fashion or to generate partial in-memory trees. See misc/atomheadlines.c in the source distribution for an example.
  • hhxml_path_* : patterns and matching. This one is incomplete and the old match.c will need to be converted to fit into the new API.
  • hhxml_pprint_* : XML pretty printer. The implementation is also mostly incomplete at this time: it only supports printing to stdout, well, because the command-line tools don’t require any more than that.

The various parts of the API will be completed as I go about writing the tools themselves, but I wanted to have clean well-defined interfaces to build onto.

α: By the way, "hhxml" stands for "(the) very short XML (library)".

Non-visible changes to the pattern matching engine and plans

Well, I’ll write more on this one when I get the time, but I’ve been looking into other implementations and theories, especially SPEX, which boasts algorithms with good complexities (better than what is currently implemented in my tools). I have plans for improvements to the engine which I believe, if I’m not wrong, would bring us similar performance, but that will have to wait, since it is not critically important, at the moment, in my opinion.

That’s all for today!

[ tag:blog.huoc.org,2009:posts/28 ]
No comments · Comment

xmltools: what's done, what's left

#. By Nhat Minh Lê. Posted on 08/13 2009 at 11:44 PM. No comments.

So the Summer of Code ends soon and it’s time to report what has been done, and what is still lacking. Obviously, everybody can see there is no xmlsed yet in the repository, so what have I been doing since I completed xmlgrep one month ago? What went wrong?

Well, if you ask me, nothing really went wrong, I have just deviated somewhat from my original goals and taken the time to do some things that weren’t part of the initial plan. And even though I am disappointed not to have finished, this has brought its share of good additions. Most of these modifications stem from discussions with David (Young) about features we’d like to see in my xmltools. Also in all these three months, I’ve spent about three weeks optimizing the matcher code.

What we have

Let’s talk about the good stuff first. Of course, the visible part of my work is the existence of xmlgrep, but most of the code actually lives in a common static library. It provides:

  • definitions for patterns and trees;
  • utilities: (tree) buffers, bit vectors, symbol look-up, state and transition caching;
  • a string to pattern parser (that I often just call "pattern compiler");
  • an XML parser (which can handle fragments and multi-roots documents, or work as an interface to Expat, for strict XML parsing);
  • an XML pretty printer;
  • the code for the matcher itself;
  • and a higher-level API used to build the tools themselves.

The library has managed to remain reasonably small, with less than 5k lines of code, counting comments and everything, and the core matching algorithm fits in less than 1.5k lines or so (compare this to the 15k-line xpath.c from libxml2). It is stream-oriented and care has been taken so it eats as little memory as possible while remaining efficient.

The pattern matching code

The matcher is undeniably the most important and sensible piece of code in there. It is a blend of a kind-of-NFA and a backtracking machine, which processes one node at the time (NFA) but is able to do unlimited look-ahead (backtracking). Recent work has involved adding new advanced constructs to the matcher (besides forward and look-ahead mechanisms, which were already implemented) and improving its performances when look-ahead is in use.

Performances

On the performance side, the main addition is that of a state cache and pre-computed states created by the pattern compiler (instead of the matching code itself).

The state cache (implemented as a hash table) operates at two levels :

  • propagation (from parent to child and sibling to sibling);
  • and filtering (deciding whether based on local properties, states match or not).

One effect of using a cache is that bit vectors need not be pooled anymore and a lot of copying has disappeared in favor of pointer swapping. Also, relying on states pre-computed by the pattern compiler has shifted the focus from looping bit by bit through bit vectors to whole-vector operations, most of which can be auto-vectorized by a sufficiently smart compiler (ICC on Linux can vectorize a dozen of these in the whole xmlgrep code, but as I’ve pointed before, GCC auto-vectorizer seems to lag behind its non-use, at least on my code and setup).

With these optimizations merged in, xmlgrep now outperforms xmlstarlet both in terms of speed and memory consumption on my silly little benchmark (for those who have not been there, it consists in grepping for a word and its definition in a 50M XML-formatted English dictionary, and yes, it’s simple-minded), even when using suboptimal look-ahead subpatterns.

Indeed, knowledge of the XML format often enables us to write optimized patterns by anchoring a more costly part with a basic predicate. In our case, instead of searching for p[hw/.="somestring"] (the suboptimal case), we could search for body/p[hw/.="somestring"] since p can only occur inside body. This is an improvement since the p[] part, being a look-ahead subpattern, is sensibly slower than a forward pattern, so reducing the number of tentative matches is essential.

Compared to benchmarks done at the time of the release of xmlgrep, handling of this suboptimal case is about three times faster now.

Features

On the feature side, which is probably more interesting for casual would-be users, I have added grouping, Kleene-like operators, and subpattern negation, as well as reworked some of the old syntax.

Subpattern negation is the simplest extension, algthough it’s fairly powerful. It can be achieved because of backtracking so it’s only available on (look-ahead) subpatterns, not groups (see below). Negation lets you write something like {a}!{a/b} which selects all elements a that do not have a b element. This can’t be done with regular patterns because a node can have many children, and thus, unlike with regexes, something like a/!b (would-be syntax) would match some a node with a child which is not a b, but there can be many other children.

For example:

$ xmlgrep -x '{dict}!{dict/@id}/key/.' test.plist

will extract only keys from unnamed dictionaries.

The second addition is Kleene-like operators. Well, sorry about the name, but it’s just what comes to mind when I use these. They are, as the name implies, somewhat like their regex counter part, the Kleene star, in that they match zero or more repetitions of the preceding predicate. E.g. a*%b matches <a/><a/><a/>...<b/> but not, say, <a/><c/><d/>...<b/>. The %% and // operators have now become special cases of the Kleene operators. a%%b can be written a%**%b and a//b is really a/**/b, where the first star stands for "any node" and the second is part of the operator.

But having these repetition operators limited to repeating a single node predicate would have been quite boring, and so groups were born. Groups, written inside parentheses () (notice these have replaced the old subpattern syntax which has been moved back to being {} from its first days), well, group a pattern fragment, so that it can be repeated. They also accept alternation so you can write a%(b|c)*%d, which just seems natural.

Let’s take an example:

$ cat test.xml
<B/><a/><b/><a/><b/><a/><b/><a/><b/><a/><b/><E/>
$ xmlgrep 'B%(a|b)*%E' test.xml
<E/>

Groups are implemented without backtracking so using a group is not much more costly than using plain forward patterns, and is faster than using look-ahead subpatterns.

Yet, that is only true of normal groups. When you use a selection operator (+, #, - or =) on a group, it becomes a capturing group, a.k.a. look-ahead group. A look-ahead group is really a look-ahead subpattern combined with a group and has a somewhat subtle behavior. It behaves like a group, with the following difference: any selection operator inside a look-ahead group will only apply if the whole group matches. This can be used to bind selectors together. Thus we can revisit the classical key/value pair example: how do we match a key/pair with some properties on the key and some other on the value? You can always do that with simple look-ahead subpatterns combined with forward patterns, but groups just make it more convenient.

Imagine you want to match all key/value pairs where the key contains the letter a and the value is a string. Easy, with a capturing group:

$ xmlgrep -x '(key[.~"a"]#%string#)+' test.plist

Other components, reusability

Some people have discussed with me the possibility of making the library reusable. As it is now, it is not yet ready for wider exposure, but a lot of work has gone into rationalizing its API, in order to prepare for that. The API needs to be stabilized and the various components need to be better segregated.

Currently, the data structures in use are tightly integrated. For instance, the parser actually builds a tree with nodes able to hold matching states. Somehow, I would need to break that dependency, so that the matcher depends on the parser, but not the opposite, so you could use the XML parser as a stand-alone component (it has a nice hybrid design). Other parts are less affected, though.

What’s left to do

Now for the bitter part. Obviously, there are still a lot of things left to do, and I intend to continue working on this project, but whether it will be integrated into NetBSD at some point in the future is not for me to decide, though I really hope it will.

In short, here is the plan:

  1. write support for xmlsed templates in the library;
  2. write xmlsed proper;
  3. change the/add a high-level API to the library to make it possible to write xmljoin (because the high level framework only accepts one input file at this point in time);
  4. see what more needs to be adjusted for xmljoin to fit into the framework;
  5. write xmljoin proper;
  6. see what needs to be done for xmlsort to fit into the framework;
  7. write xmlsort.

Of course, I have given some thoughts to the "see what needs to be done" phases, but now is not right time to talk about that.

Also, at some point, I’ll need to address several shortcomings of the current code base:

  • add locale support;
  • convert my bunch of "regression shell scripts" to proper test suites (ATF, probably);
  • add non-NetBSD build support (currently, I have to hand-compile it on my Linux box).

And I will also want to:

  • clean-up the API and release the core components as a stand-alone library;
  • write documentation on the internals to accompany that library;
  • and of course, integrate it into NetBSD, if I am given permission. :)

Lastly, about xmlsed, though you can see no xmlsed directory in the repository, I can say it is pretty close to completion since the pieces needed are almost all there in the library (the only one missing is the template support, and that’s not the hard part).

Conclusion

The Google Summer of Code is nearing its end, but my summer vacations are not over yet. I won’t have classes until September, so I plan to carry on as if nothing much happened at least till the end of August. After that, I won’t have nearly as much free time, but I’ll do what I can; let’s not be hasty and wait till then to see how far the project has gone.

I will continue to update this blog and the code will be available at my Git repository, as usual. (Anyway, I have just recently added comments to my blog so feel free to drop by and give me your opinion.)

[ tag:blog.huoc.org,2009:posts/21 ]
No comments · Comment

xmltools implementation: automata and backtracking

#. By Nhat Minh Lê. Posted on 07/20 2009 at 12:39 AM. No comments.

At the moment, I’m implementing submatches in patterns, in preparation for xmlsed. Since most of what I do draws from formal languages and automata (except I’m too lazy to write anything formal), I’ve been looking into TNFAs for inspiration. Seeing how TRE seems to implement look-ahead without backtracking, I began to ask myself if such a thing could be done for my pattern matcher. Unfortunately, I can’t seem to figure out a solution by myself, so I’m blogging instead. :)

Basically, the problem is with constructs such as a[b]/c which match c, child of a only if a has a child b. However, when we reach a, we don’t know yet whether it has a child b or not, so we’re in need of some look-ahead.

This is akin to a(?=b)c in Perl regexes, but as you can already see, it’s fairly different. Perhaps the most obvious difference is that a tree has "two dimensions" while a string has only one, so it’s possible for a to have two children b and c, in a tree, while it’s not possible for a to be followed by two different characters, in a string.

Backtracking

The easiest way to solve this problem is to use backtracking. This is how the matcher is currently implemented. Even though about everything else is based solely on states and transitions, the part implementing look-ahead predicates uses backtracking.

The algorithm looks like this: everytime we have to match a look-ahead predicate, we stop, read as many nodes as we need to decide and then either drop the state or keep it, depending on the outcome.

This is a simple approach and it works well (i.e. it gives the expected results). Only, it’s slow. As a comparison, using the patterns I’ve used before for my benchmarks, the difference between hw/.="Ab\"a*cus" and p[hw/.="Ab\"a*cus"]# is the second one is about 2.66 times slower. This can be hand-optimized by rewriting the second pattern as body/p[hw/.="Ab\"a*cus"]#, with knowledge of the input format, where p can only appear as a child of body. This cuts the slowness to a mere 1.62 times slower.

At the same time, it demonstrates a major source of inefficiency in my backtracking implementation: loose (non-anchored) subpatterns, which may potentially apply to a great number of nodes yet would fail immediately for most of them, result in a lot of unwanted short backtracking segments where the matcher tries the subpattern and then immediately backtracks after failing, thus still examining the node twice.

Product automata

In theory, a pattern such as a(?=b)c matches if a is followed by both b and c so another natural idea would be to use a product automaton which transitions between couples of states: if we end up in two final states, then we win.

Only, this works well because we want to know whether the string matches as a whole. However, with an XML pattern such as a[b]/c, we need to identify which node matched the c part. This is where things start to get complicated.

Now we do know how to extract a submatch from a string with tagged transitions. So, great, does this apply to our XML matcher? The answer is "no", as far as I know (I’d be glad if someone proved me wrong, though). Tagged transitions are good at extracting one submatch, the last one, but in our case, the problem is we want all occurrences of c, not just the last one.

Keeping this information globally associated with the state is not an option since there may well be multiple instances of the look-ahead predicate trying to match at the same time. But trying to differentiate between two states based on the matching position is not an option either, because that could easily lead to linear memory usage. Consider, for example, the following XML fragment:

<a/><a/><a/><a/><a/>...<b/>

Trying to match {a%%b} (or (a%%b) using the current syntax in the master branch; sorry, I reverted to the old syntax because () will be used for groups) against that would result in a match at every a element. Hence, keeping the position as a state property is not a solution, since we’d have to distinguish between as many a nodes as there are in the input data.

Optimizing the backtracking method

Although the implementation of the matcher has undergone many changes in order to improve performances recently, mostly so as to accommodate new features while keeping a reasonable running time, I believe there’s still room for much improvement, maybe in the form of algorithmic enhancements rather than implementation optimizations.

Some of my more realistic ideas are:

  • Read one element ahead to reject some subpatterns without entering a backtracking context.

  • Try to predict the outcome of a look-ahead predicate and compute the more likely transitions along with the predicate itself, saving on backtracking (except to prune the tree) in case we’re right (but incuring additional recomputations if we’re wrong).

Also, this is not directly related to the backtracking approach, but I’m thinking of putting more information in the state cache; at the moment, we’re only caching "raw" transitions which account for the old state set and the direction (i.e. successor or child); we could try to cache local information as well, which would move us one step closer to the way NFA transition functions get cached to get DFAs.

Conclusion: do we need performance that badly?

Well, that’s a good question. And the answer is "we probably don’t". This is kind of my hobby, so don’t get the wrong idea. The whole thing is not that slow. On my simple example, xmlgrep was only 0.5s slower than libxml-powered xmlstarlet, taking less than six seconds to look up a word definition in a 58M dictionary, on my 2x2GHz Core 2, while using 100 times less memory.

I doubt anybody is going to use xmlgrep to look up words in dictionaries anytime soon; such a specialized task would best be served using an indexed collection of some sort, although I should mention that there are XML formats which are more suited to being parsed using my tools than others, so converting to a more appropriate (e.g. annotated) XML format using xmlsed then searching with xmlgrep could well be a viable solution too (given the conversion is one-time or infrequent compared to searches). More on that once xmlsed is out!

[ tag:blog.huoc.org,2009:posts/16 ]
No comments · Comment

xmltools update: new I/O layer and further plans

#. By Nhat Minh Lê. Posted on 07/08 2009 at 03:45 PM. No comments.

It has been nearly two weeks since my last update on my plans for xmlsed. Since that time, David and I have come to an agreement on the (hypothetical) syntax and my hybrid model has been retained.

Rewriting the I/O module

Although I now have a fair idea of what xmlsed should look like and how to implement it, some elements introduced after the recent discussions with my mentor required changes to the existing code.

In particular, the new template syntax required an XML parser able to handle partial XML documents. Besides, I wanted some syntactic sugar on top of XML to make writing templates easier. Expat being a well-behaved XML parser is quite strict about the syntax and there is no easy way to work around that. Though I did spend some days trying, at first, I eventually gave up, as it proved hard to write, let alone maintain in the future.

At the same time, I began to realize the original I/O abstraction I designed was showing its limits. I initially wanted to be able to read and write multiple tree representation formats, but in the end, the need to fully support XML, with all its specifics (doctypes, PIs, etc.), has convinced me to drop the idea and focus on XML alone.

All these reasons led me to rewrite the I/O layer (almost) completely. This work is being committed to yet another temporary branch: xmlpush.

Maybe the most visible change is that multi-backend support was dropped and the I/O module now consists of only two drivers: the Expat-based parser (the strict parser) and a home-made loose parser.

All tools now support two modes: the strict mode (-s flag) and the default loose mode.

  • In the strict mode, compliance with the XML standard is a priority: all extensions are disabled, including multi-root and partial documents (which was implemented with Expat as an ugly kludge before, so it was removed), the XML prolog is honored (the specified encoding is used and entity declarations are parsed), and stricter rules are enforced (e.g. in names).

  • In the loose mode, extensions are supported and the XML prolog is ignored (instead, we use the system locale).

The encoding issue

Although I’ve just stated how encodings will be handled, at the moment, I have not written any code to suppor that yet. Actually, it’s not a simple matter.

First of all, the XML standard mandates support of Unicode. For one thing, XML character entities (&xxxx;) are references to character codes from the Unicode tables. This makes support for Unicode pretty much mandatory.

In order to handle this, the Expat people have chosen to serve all data as UTF-8 (or UTF-16, depending on the build-time configuration), no matter what input encoding is used. This should have given you a hint: Expat has its own encoding conversion engine.

Now, even though XML documents can specify an encoding, and so it would be alright to always use UTF-8, that is not an acceptable solution in the real world: Unix users expect their programs to comply to the current locale. But Expat does not care one way or the other about, or integrate with, the locale system.

Fortunately, NetBSD has iconv(3) (though neither FreeBSD nor OpenBSD does, apparently, which will be a problem in making my programs portable; there seems to be an ongoing GSoC effort to port NetBSD libiconv to FreeBSD though), so I plan to run everything through that on input and again on output. Sounds like a waste of resources? Well, don’t blame me.

However, that’s not all: there is also the loose parser. Since this one was written by me, and I had no reason to force UTF-8 everywhere, it does no conversion and assumes all sources are in the current locale. The only problem will be for character entities (not currently implemented), but I intend to localize the Unicode-to-locale transformation to these, since it is costly.

Remember the push-style vs event-driven parser debate?

Well, it wasn’t really a debate, but maybe some of you remember that I posted some weeks ago a ticket on this blog about how I grew dissatisfied with the rigid event-driven behavior of Expat and wanted a push-style parser.

I took the occasion this time to make my wish come true (almost) and the new I/O design is mostly push-style. I say mostly because I’ve made some compromises in order not to impair performances.

At first, I thought about having the parser run on a fragment of input text and build a queue of events to be fed to the application one by one, on a on-demand basis. But then I realized I could just use the internal tree directly as my event queue, and have the application read that. Since we have support for look-ahead in the matching engine (and hence need to keep a partial internal tree in any case) and most tools will use that, this effectively moved the tree building code from the matcher to the I/O module, at the same time eliminating direct event responses (a bit less than 500 lines of code). This last point needs some clarification: sure we do use a little more memory (but honestly that just doesn’t show) since we build the tree first and only then process it, but this is limited to how many nodes can be represented in one read buffer, which means it’s mostly insignificant. But we have gained what I think is far more important: an unified implementation of the matcher (which is the most sensitive part).

Further plans

At the moment, the new parser does not support things such as entity declarations and I am still pondering whether to write code for that or not. In any case, it would be a good idea to have a xmlcat(1) utility which fully supports XML, including external entities, with the ability to fetch and include external documents (fetch(3)?) and substitute references. But let’s leave this discussion for another time.

As for the locale support, I think it will come after xmlsed is somewhat ready (i.e. as xmlgrep is right now). So for now, some more testing needs to be done on the new I/O components, and when this is done, I will resume work on xmlsed proper.

[ tag:blog.huoc.org,2009:posts/15 ]
No comments · Comment

xmlsed preview: writing XML

#. By Nhat Minh Lê. Posted on 06/25 2009 at 12:48 PM. No comments.

One might be tempted to think that, while parsing can be a complex task involving various competing models, printing is straightforward and should be the easy part of the job. This is not quite true however, as David and I have been in disagreement for some time already over the exact semantics and syntax of a potential XML printing subsystem of xmltools.

This post is all about the design of the printing language in xmltools and the underlying model. I also present at the end my latest hybrid proposal, that I’ve submitted to David for approval earlier today, and which I hope will bring satisfaction to all parties. If you have some opinion on this topic, do not hesitate to e-mail me! (My e-mail is my first name followed by my last name with all spaces replaced by dots; it’s written in French in the footer so you might not be able to decode it.)

The basic problem can be described as follows: we have a source tree, the input document and we want to transform it by adding or removing nodes in some places. Simple? Not quite.

Select & transform

The first design I proposed is based on sed(1): we select some node in the tree using a tree pattern (same as in my xmlgrep) and then transform it using some actions: insert, append, replace, etc. Only these needed to be adapted to XML.

My idea was then to use PYX/ESIS-inspired actions, which also happen to model closely my internal API. These consisted of a one-character prefix indicating the thing to print, followed by contents (depending on the prefix). Prefixes were:

( opening tag
) closing tag
@ attribute
. text
! comment
? junk

This is a very simple approach but David objected that it was unnatural to write XML using some language which was not XML.

Advantages
  • Very simple and straightforward.
  • Completely symmetrical to the way patterns are matched.
Disadvantages
  • Code used to write XML does not look like the output.
  • Not intuitive to read as data.

Decorate & pull

Another approach, which is used most notably by XSLT, is to have a template, which itself is written as XML, in which some elements from the source document are pulled.

Let me say it upfront in case anybody is not clear on this: pulling is not a stream technique, so this model as is cannot be implemented easily (some subsets of XSLT as stream transformations have been researched, but since it is from the start not a stream-oriented technique, it’s complex and IMHO not worth it).

David first suggested using a template as a way of having printing-heavy script actually look like their output. He observed plain XML could be written directly while some special elements, e.g. xmlsed:addattr could be used to print things that cannot be represented directly in XML. Indeed, there is no XML syntax to write just an attribute without the surrounding tag, yet there are cases where we might want to add an attribute to an existing element.

I criticized this idea based on the fact that reserving a namespace was intrusive and that using XML to represent scripting elements was too verbose.

But the main point was that pull-style transformation was never an option.

Advantages
  • XML in script remains in output.
  • Very intuitive and easy to read.
Disadvantages
  • Model is not adapted to streaming.
  • Too verbose.

Towards a hybrid model: XML syntax for select & transform

Given the stream-unfriendly nature of the template-based model, David came up with the idea to represent actions of the select & transform model as XML: XML opening and closing tags could be used to represent the corresponding printing actions (respectively ( and )) but for the rest, we were still stuck with the way-too-verbose XSLT-like <xmlsed:whatever> elements.

I was still very dissatisfied with the verbosity of having to write <xmlsed:whatever/> just to add an attribute or specify where to insert the current node (before, after, inside the plain XML part), but the concept itself, of using a template, was quite nice.

Advantages
  • XML in script remains in output.
  • Very intuitive and easy to read.
  • Still reasonably simple to implement.
Disadvantages
  • Still too verbose.

A hybrid model

Today, after more brainstorming, I finally decided on the following model, and unless David is really against it (which I seriously doubt), I think it will make it into xmlsed one way or the other: the idea is the same as that of the previous attempt, use a template which actually describes actions inferred from the structure of XML itself.

Only, this time, we do not use special elements to indicate actions anymore, everything is inferred from the structure. The trick to do this is to have a virtual element represent the current node; it can have any name (it will be user-settable), let’s call it : by default. We can place it wherever we want in the template. And to add an attribute (resp. a child), we just have to add an attribute (resp. a child) to that virtual element. The syntax remains reasonably compact (even optimally compact given we’re using XML as the basis for our template) and the script now very closely models the output, even more than XSLT scripts!

The idea was inspired from xargs(1) and its -I option. Again, Unix saved the day… well, not quite, but still, I think it is a nice and practical compromise. :)

[ tag:blog.huoc.org,2009:posts/10 ]
No comments · Comment

Next page Page: 0 1