Ours & Hippy — le blogOurs & Hippyourshippy@huoc.orgtag:blog.huoc.org,2009:atom2010-04-06T19:29:07+02:00tag:blog.huoc.org,2009:posts/48
Not dead yet: xmltools, long after the summer
2010-04-06T19:29:07+02:002010-04-06T19:29:07+02:00Nhat Minh Lê (rz0)
<p>Soon, it will be the Summer (of Code) again, and this year, too,
NetBSD will be participating.<sup>α</sup> At around the same time, last year,
I was busy writing technical proposals for what would become xmltools,
one of the accepted NetBSD SoC projects of 2009. Well, most of you
have probably forgotten about that small project; some may even think
it was a useless attempt and a waste of time and money, which could
have been better spent elsewhere… Err, in fact, I’m writing this to
say that my xmltools are not dead! And I still intend to contribute
the code to the NetBSD tree sometime in the (near) future.
</p><div class="Notes"><p>α : And if you’re a student who’s into systems and stuff, looking
for a cool organization to join, with a broad selection of projects,
why not <a class="extern" href="http://www.netbsd.org/contrib/soc-projects.html">give NetBSD
a try</a>? :)
</p></div><p>That being said, if you take a look at the Git repo, you’ll surely
notice that nothing has moved for months… If you fool around a bit
more, though, you may notice another project on the same Git host:
<a class="extern" href="http://git.huoc.org/?p=regxml.git;a=summary">regxml</a>. Yep, this is
a rewrite of my xmltools!
</p><p>Why a rewrite, you ask? Well, basically because the old implementation
was flawed by design, in addition to having become a mess.
</p><h3>The little boring story
</h3><p>And here’s the long answer, for those who care. Feel free to skip this
section if it bores you.
</p><p>During the second part of last year’s SoC, as I was trying to work on
xmlsed, I realized my design was flawed. The main problem was that
I had based all my efforts around a concept that was, for our purpose,
completely inappropriate: node sets.
</p><p>So what’s a node set? A node set is simply that: a set of nodes. Now,
that’s great if you are extracting data from an XML document, as did
<b>xmlgrep</b>, but it sucks if you’re going to operate further on
that. In particular, removing arbitrary nodes from an XML tree doesn’t
yield a tree, and replacing a node set made no sense at all. (How
would you replace two far away nodes with one big chunk of XML? There
are many ways to go about it… but they’re all <em>wrong</em> IMO.)
</p><p>So what emerged at the end of the summer was that… I needed a new
design, and as my original algorithm had become crippled with dubious
ad hoc attempts to support operations that it never could, I needed
a new algorithm too. (Besides, there was a deficiency in my algorithm
that made it perform in super-polynomial time with some patterns and
data sets, which was bad.)
</p><p>It took me something like one month (since I wasn’t full-time on it)
to jot down the new concepts and a (hopefully) good algorithm on
paper. Then, for months, I’d try to implement it, but each time,
something would feel really wrong, and I’d trash my new
implementation. Well, that was till some days ago, when somehow, I got
it right, and so the new xmltools were born!
</p><h3>XML intervals
</h3><p>So all this boils down to the following fundamental change: XML node
sets have been replaced by XML intervals, as the core objects the
xmltools operate on. An XML interval is really just that: a bunch of
adjacent siblings and all their children. You should read the included
<b>xmltools(7)</b> manual page for more information, from a user’s point
of view.
</p><p>From a more theoretical viewpoint, intervals have two important
advantages over sets:
</p><ol><li><p>Inserting or removing an XML interval in an XML tree keeps it
a tree. In other words, XML trees are stable by insertion and
deletion of arbitrary XML intervals. (It may not have any meaning
anymore, but remember xmltools are about <em>syntax</em>, not semantics.)
</p></li><li><p>XML intervals map (injectively) to byte intervals in the XML file,
hence all editing operations on XML intervals (sequences of
insertions and deletions) can be expressed as operations on
sequences of bytes in the file itself.
</p></li></ol><p>The first property is what I am convinced will make <b>xmlsed</b>
possible. Previously, with node sets, it was a nightmare to give
proper semantics to the various operations. But now, we have simple
objects that still retain a great deal of expressivity but can be
manipulated easily, with simple rules.
</p><p>Although I do not use the second property yet, I’m definitely thinking
about introducing an API for that sometime in the (not quite near)
future.
</p><p>The idea is that you run the XML matcher on a file mapped in memory
(mapping new chunks as needed), and get a program script consisting of
byte operations, that you can then run very efficiently on the
memory-mapped file! Instead of printing an interval, you’d just copy
over the entire memory region. Instead of deleting an interval, you’d
just ignore that chunk. You get the idea. It could be used for
efficient processing of massive data with complex transformation
rules, but we’re not quite there, yet. :)
</p><h3>xmltools and NetBSD
</h3><p>As I’ve discussed with David (<dyoung>), I intend to import my core
library, as well as xmlgrep into base sometime in the near future
(once it has undergone more thorough testing). It depends on Expat for
XML parsing, so I’ll need to import that too.
</p><p>After much thinking, I think I’ll settle with a subdirectory in
<code>src/external/bsd</code>, though I use the NetBSD build system, as I want to
keep my development model using Git, and the ability to easily make
packages of all my tools for non-NetBSD platforms.
</p><p>The new code base uses (Net)BSD idioms more extensively than the old
one (because I’ve got the time to read more NetBSD code since
I started in May of last year). (E.g. I’ve followed David’s suggestion
that I should make use of <code>sys/queue.h</code>.) Quite surprisingly, it
compiles almost as is on GNU/Linux, with three additional defines and
two trivial functions. I do provide a simple GNU makefile in addition
to the default NetBSD build system makefiles, for GNU/Linux builds. Of
course, you’re welcome to try to build it on other systems I don’t
have access to!
</p><h3>What the future holds
</h3><p>Well, what’ll come next depends on a lot of factors, some of which are
completely outside of my control. But there is one thing <em>you</em> can do:
please download, compile, read the man pages, and test xmlgrep and
give me some feedback, so I can fix bugs!
</p><blockquote><div><pre><code>$ git clone git://git.huoc.org/regxml.git
</code></pre><p>Or through Git-Web:
<a class="extern" href="http://git.huoc.org/?p=regxml.git;a=summary">http://git.huoc.org/?p=regxml.git;a=summary</a>
</p></div></blockquote><p>As far as the xmltools project is concerned, I’m now going to
concentrate on xmlsed. I already have an idea of the implementation,
and I think it’s not going to be too hard… hopefully, I’m not
wrong. :)
</p><p>P.S.: Oh, I forgot to talk about the new algorithm. Well, to get the
idea, it’s basically an automaton backed by a backtracking memoizing
work-queue-based interpreter, which roams the XML tree and try to
match intervals. You should read the <b>xmltools(7)</b> man page for
a high-level overview of the technique. Maybe I’ll write something
more consistent on the subject later…
</p>
tag:blog.huoc.org,2009:posts/33
xmlsed prototype
2009-10-12T22:30:46+02:002009-10-12T22:30:46+02:00Nhat Minh Lê (rz0)
<p>Following David Young’s post on the official NetBSD blog, here’s my
own small announcement: xmlsed is here, or at least a somewhat working
first version. It’s in the Git repository, under the master branch
(everything’s been merged back into master some time ago).
</p><div class="Avertissement"><p>It’s mostly untested, and the exact interface/behavior is likely to
change, but if you’re interested in helping out, please read the man
pages included in the distribution (<b>xmlsed(1)</b> and
<b>xml_pattern(7)</b> should get you started) and play a bit with some
XML data (you can find sample files in <code>tests/data/</code>).
</p></div><p>So, what does my xmlsed do? Basically, it’s somewhat like <b>sed(1)</b>
except it does only one of the many things sed can do (though it’s
probably the most popular feature of sed): it replaces things with
other things, where, in our case, a "thing" is one or more nodes.
</p><p>It is based on the new event-handling mini internal framework and uses
the same patterns as xmlgrep (and the same matching code), augmented
with the register-binding operator <code>>=</code> which lets you put captured
nodes into a named variable, called a register. Then you can put nodes
of the form <code><$REGNAME/></code> in your replacement templates, and <i>voilà</i>!
The register’s contents gets inserted in place of the
<code><$REGNAME/></code>. Something like <code>xmlsed '((a#)>=a%(b#)>=b) <$b/><$a/>'</code>
should swap around two consecutive nodes <code>a</code> and <code>b</code>… well, I very
much hope it does.
</p><p>Well, that’s for the good part. Now, there are many things left
unresolved in xmlsed, mainly the fact that its behavior is not
consistent with xmlgrep. In xmlsed, if you replace a node, it also
deletes all of its descendants, whereas when you select a node in
xmlgrep (or in a capture in xmlsed), it comes alone, without any of
its descendants (except attributes). In fact, you can even do a lot
more weird stuffs in selections like selecting and grouping together
non directly connected parts of the tree. To sum up: selection is way
more powerful than replacement, and by default, they don’t really look
alike (and they aren’t quite implemented the same way either…).
</p><p>I’m thinking about rationalizing all this, along with the
syntax. Through the many feature additions (as David and I decided we
needed more power), selection patterns have probably become too
convoluted, unnecessarily complex and fine-grained where it doesn’t
matter. However, we should not give up on all that flexibility and
probably bring some over to the replacement mechanism. It’ll also be
a good opportunity to revise the syntax (more XPath-isms?) and migrate
to some automated parser generator, one that can create reentrant and
push-style parsers. I have that <a class="extern" href="http://git.huoc.org/?p=yacc.git;a=summary">big patch of NetBSD yacc</a>
doing just that, waiting in my Git repository, but it probably needs
more polishing before I can hope to submit it to the mailing lists to
get myself flamed for doing things on my own, but oh well, we’ll see.
</p><p>Now’s not the time for that yet. First, I need to test xmlsed more
thoroughly to ensure that the implementation underlying the syntax
really works.
</p><p>Also, other things that I’ll want/need to do once I’m done with xmlsed
(to the point where it runs well enough, doing something useful) are,
in no particular order:
</p><ul><li><p>Rewrite the matching code to be more generic, simple, and readable:
the idea is to split various constructs into more elementary blocks
(I haven’t decided exactly what they would be, but I’m thinking
about it) and have some kind of automaton feed on these
instructions and move accordingly.
</p></li><li><p>Improve the data structures in use, both for efficiency reasons and
to make it possible to (theorically) run multiple instances of the
matcher on the same stream of data, hopefully opening the way for
some more concurrency. This last point is almost possible now, but
not quite yet; there are still some matching data that remain in the
node itself while they shouldn’t.
</p></li><li><p>Implement the amended algorithm I’ve devised that should (if I’m
correct…) eliminate the "bad" parts of the complexity that make it
theorically inferior to the transducer network algorithm, while
taking somewhat more memory (though the asymptotic bounds on memory
consumption should stay the same).
</p></li><li><p>Do something about locales; also handle integration into base, and
packaging for non-NetBSD systems. In short: better integration.
</p></li></ul><p>Then of course, there’ll also be work on the remaining tools: some
kind of xmljoin is probably next after xmlsed. In which order
I address the various points, I don’t know yet; it’ll probably depend
on what I feel is more urgent, and what I feel like doing first. :)
</p>
tag:blog.huoc.org,2009:posts/10
xmlsed preview: writing XML
2009-06-25T12:48:34+02:002009-06-25T12:48:34+02:00Nhat Minh Lê (rz0)
<p>One might be tempted to think that, while parsing can be a complex
task involving various competing models, printing is straightforward
and should be the easy part of the job. This is not quite true
however, as David and I have been in disagreement for some time
already over the exact semantics and syntax of a potential XML
printing subsystem of xmltools.
</p><p>This post is all about the design of the printing language in xmltools
and the underlying model. I also present at the end my latest hybrid
proposal, that I’ve submitted to David for approval earlier today, and
which I hope will bring satisfaction to all parties. If you have some
opinion on this topic, do not hesitate to e-mail me! (My e-mail is my
first name followed by my last name with all spaces replaced by dots;
it’s written in French in the footer so you might not be able to
decode it.)
</p><p>The basic problem can be described as follows: we have a source tree,
the input document and we want to transform it by adding or removing
nodes in some places. Simple? Not quite.
</p><h3>Select & transform
</h3><p>The first design I proposed is based on <b>sed(1)</b>: we select some
node in the tree using a tree pattern (same as in my xmlgrep) and then
transform it using some actions: insert, append, replace, etc. Only
these needed to be adapted to XML.
</p><p>My idea was then to use PYX/ESIS-inspired actions, which also happen
to model closely my internal API. These consisted of a one-character
prefix indicating the thing to print, followed by contents (depending
on the prefix). Prefixes were:
</p><table><tr><td><code>(</code> </td><td>opening tag
</td></tr><tr><td><code>)</code> </td><td>closing tag
</td></tr><tr><td><code>@</code> </td><td>attribute
</td></tr><tr><td><code>.</code> </td><td>text
</td></tr><tr><td><code>!</code> </td><td>comment
</td></tr><tr><td><code>?</code> </td><td>junk
</td></tr></table><p>This is a very simple approach but David objected that it was
unnatural to write XML using some language which was not XML.
</p><dl><dt><b>Advantages</b></dt><dd><ul><li>Very simple and straightforward.
</li><li>Completely symmetrical to the way patterns are matched.
</li></ul></dd><dt><b>Disadvantages</b></dt><dd><ul><li>Code used to write XML does not look like the output.
</li><li>Not intuitive to read as data.
</li></ul></dd></dl><h3>Decorate & pull
</h3><p>Another approach, which is used most notably by XSLT, is to have
a template, which itself is written as XML, in which some elements
from the source document are <em>pulled</em>.
</p><p>Let me say it upfront in case anybody is not clear on this: pulling is
<em>not</em> a stream technique, so this model as is cannot be implemented
easily (some subsets of XSLT as stream transformations have been
researched, but since it is from the start not a stream-oriented
technique, it’s complex and IMHO not worth it).
</p><p>David first suggested using a template as a way of having
printing-heavy script actually look like their output. He observed
plain XML could be written directly while some special elements,
e.g. <code>xmlsed:addattr</code> could be used to print things that cannot be
represented directly in XML. Indeed, there is no XML syntax to write
just an attribute without the surrounding tag, yet there are cases
where we might want to add an attribute to an existing element.
</p><p>I criticized this idea based on the fact that reserving a namespace
was intrusive and that using XML to represent scripting elements was
too verbose.
</p><p>But the main point was that pull-style transformation was never an
option.
</p><dl><dt><b>Advantages</b></dt><dd><ul><li>XML in script remains in output.
</li><li>Very intuitive and easy to read.
</li></ul></dd><dt><b>Disadvantages</b></dt><dd><ul><li>Model is not adapted to streaming.
</li><li>Too verbose.
</li></ul></dd></dl><h3>Towards a hybrid model: XML syntax for select & transform
</h3><p>Given the stream-unfriendly nature of the template-based model, David
came up with the idea to represent actions of the select & transform
model as XML: XML opening and closing tags could be used to represent
the corresponding printing actions (respectively <code>(</code> and <code>)</code>) but for
the rest, we were still stuck with the way-too-verbose XSLT-like
<code><xmlsed:whatever></code> elements.
</p><p>I was still very dissatisfied with the verbosity of having to write
<code><xmlsed:whatever/></code> just to add an attribute or specify where to
insert the current node (before, after, inside the plain XML part),
but the concept itself, of using a template, was quite nice.
</p><dl><dt><b>Advantages</b></dt><dd><ul><li>XML in script remains in output.
</li><li>Very intuitive and easy to read.
</li><li>Still reasonably simple to implement.
</li></ul></dd><dt><b>Disadvantages</b></dt><dd><ul><li>Still too verbose.
</li></ul></dd></dl><h3>A hybrid model
</h3><p>Today, after more brainstorming, I finally decided on the following
model, and unless David is really against it (which I seriously
doubt), I think it will make it into xmlsed one way or the other: the
idea is the same as that of the previous attempt, use a template which
actually describes actions inferred from the structure of XML itself.
</p><p>Only, this time, we do not use special elements to indicate actions
anymore, <em>everything</em> is inferred from the structure. The trick to do
this is to have a virtual element represent the current node; it can
have any name (it will be user-settable), let’s call it <code>:</code> by
default. We can place it wherever we want in the template. And to add
an attribute (resp. a child), we just have to add an attribute
(resp. a child) to that virtual element. The syntax remains reasonably
compact (even <em>optimally compact</em> given we’re using XML as the basis
for our template) and the script now very closely models the output,
even more than XSLT scripts!
</p><p>The idea was inspired from <b>xargs(1)</b> and its <code>-I</code> option. Again,
Unix saved the day… well, not quite, but still, I think it is a nice
and practical compromise. :)
</p>