Ours & Hippy — le blogOurs & Hippyourshippy@huoc.orgtag:blog.huoc.org,2009:atom2009-07-08T15:45:28+02:00tag:blog.huoc.org,2009:posts/15
xmltools update: new I/O layer and further plans
2009-07-08T15:45:28+02:002009-07-08T15:45:28+02:00Nhat Minh Lê (rz0)
<p>It has been nearly two weeks since my last update on my plans for
xmlsed. Since that time, David and I have come to an agreement on the
(hypothetical) syntax and my hybrid model has been retained.
</p><h3>Rewriting the I/O module
</h3><p>Although I now have a fair idea of what xmlsed should look like and
how to implement it, some elements introduced after the recent
discussions with my mentor required changes to the existing code.
</p><p>In particular, the new template syntax required an XML parser able to
handle partial XML documents. Besides, I wanted some syntactic sugar
on top of XML to make writing templates easier. Expat being
a well-behaved XML parser is quite strict about the syntax and there
is no easy way to work around that. Though I did spend some days
trying, at first, I eventually gave up, as it proved hard to write,
let alone maintain in the future.
</p><p>At the same time, I began to realize the original I/O abstraction
I designed was showing its limits. I initially wanted to be able to
read and write multiple tree representation formats, but in the end,
the need to fully support XML, with all its specifics (doctypes, PIs,
etc.), has convinced me to drop the idea and focus on XML alone.
</p><p>All these reasons led me to rewrite the I/O layer (almost)
completely. This work is being committed to yet another temporary
branch: <code>xmlpush</code>.
</p><p>Maybe the most visible change is that multi-backend support was
dropped and the I/O module now consists of only two drivers: the
Expat-based parser (the <i>strict</i> parser) and a home-made <i>loose</i>
parser.
</p><p>All tools now support two modes: the strict mode (<code>-s</code> flag) and the
default loose mode.
</p><ul><li><p>In the <strong>strict mode</strong>, compliance with the XML standard is
a priority: all extensions are disabled, including multi-root and
partial documents (which was implemented with Expat as an ugly
kludge before, so it was removed), the XML prolog is honored (the
specified encoding is used and entity declarations are parsed), and
stricter rules are enforced (e.g. in names).
</p></li><li><p>In the <strong>loose mode</strong>, extensions are supported and the XML prolog
is ignored (instead, we use the system locale).
</p></li></ul><h3>The encoding issue
</h3><p>Although I’ve just stated how encodings will be handled, at the
moment, I have not written any code to suppor that yet. Actually, it’s
not a simple matter.
</p><p>First of all, the XML standard mandates support of Unicode. For one
thing, XML character entities (<code>&xxxx;</code>) are references to character
codes from the Unicode tables. This makes support for Unicode pretty
much mandatory.
</p><p>In order to handle this, the Expat people have chosen to serve all
data as UTF-8 (or UTF-16, depending on the build-time configuration),
no matter what input encoding is used. This should have given you
a hint: Expat has its own encoding conversion engine.
</p><p>Now, even though XML documents can specify an encoding, and so it
would be alright to always use UTF-8, that is not an acceptable
solution in the real world: Unix users expect their programs to comply
to the current locale. But Expat does not care one way or the other
about, or integrate with, the locale system.
</p><p>Fortunately, NetBSD has <b>iconv(3)</b> (though neither FreeBSD nor
OpenBSD does, apparently, which will be a problem in making my
programs portable; there seems to be an <a class="extern" href="http://kovesdan.org/doc/en_US.ISO8859-1/soc2009/soc2009.html">ongoing GSoC effort to port
NetBSD libiconv to FreeBSD</a> though), so I plan to run
everything through that on input and again on output. Sounds like
a waste of resources? Well, don’t blame me.
</p><p>However, that’s not all: there is also the loose parser. Since this
one was written by me, and I had no reason to force UTF-8 everywhere,
it does no conversion and assumes all sources are in the current
locale. The only problem will be for character entities (not currently
implemented), but I intend to localize the Unicode-to-locale
transformation to these, since it is costly.
</p><h3>Remember the push-style vs event-driven parser debate?
</h3><p>Well, it wasn’t really a debate, but maybe some of you remember that
I posted some weeks ago a ticket on this blog about how I grew
dissatisfied with the rigid event-driven behavior of Expat and wanted
a push-style parser.
</p><p>I took the occasion this time to make my wish come true (almost) and
the new I/O design is <em>mostly</em> push-style. I say <i>mostly</i> because I’ve
made some compromises in order not to impair performances.
</p><p>At first, I thought about having the parser run on a fragment of input
text and build a queue of events to be fed to the application one by
one, on a on-demand basis. But then I realized I could just use the
internal tree directly as my <i>event queue</i>, and have the application
read that. Since we have support for look-ahead in the matching engine
(and hence need to keep a <em>partial</em> internal tree in any case) and
most tools will use that, this effectively moved the tree building
code from the matcher to the I/O module, at the same time eliminating
direct event responses (a bit less than 500 lines of code). This last
point needs some clarification: sure we do use a little more memory
(but honestly that just doesn’t show) since we build the tree first
and only then process it, but this is limited to how many nodes can be
represented in one read buffer, which means it’s mostly
insignificant. But we have gained what I think is far more important:
an unified implementation of the matcher (which <em>is</em> the most
sensitive part).
</p><h3>Further plans
</h3><p>At the moment, the new parser does not support things such as entity
declarations and I am still pondering whether to write code for that
or not. In any case, it would be a good idea to have a <b>xmlcat(1)</b>
utility which fully supports XML, including external entities, with
the ability to fetch and include external documents (<b>fetch(3)</b>?)
and substitute references. But let’s leave this discussion for another
time.
</p><p>As for the locale support, I think it will come after xmlsed is
somewhat ready (i.e. as xmlgrep is right now). So for now, some more
testing needs to be done on the new I/O components, and when this is
done, I will resume work on xmlsed proper.
</p>
tag:blog.huoc.org,2009:posts/8
xmlgrep by examples: playing with Atom
2009-06-17T14:23:11+02:002009-06-22T14:30:22+02:00Nhat Minh Lê (rz0)
<p><i>Edited 22.06.</i> <strong>Incompatible syntax changes</strong> and more examples.
</p><p>Since Matthew Sporleder <a class="extern" href="http://mail-index.netbsd.org/tech-userlevel/2009/06/16/msg002258.html">on tech-userlevel</a> has (implicitly)
suggested that xmlgrep could be used to dissect Atom feeds, but was
a bit lost at how to do it exactly, I thought I’d post a little demo
of a console session playing with an Atom file and xmlgrep (as well as
some other command-line tools).
</p><p>First, let’s fetch the feed:
</p><pre><code>$ ftp http://mspo.com/blog/atom.xml
[...]
</code></pre>As a side effect of xmlgrep, we might want to indent the XML file to
make it human-readable:
<pre><code>$ xmlgrep '*' atom.xml | more
</code></pre>List all posts in the NetBSD category with their IDs:
<pre><code>$ xmlgrep -x 'entry[category/@term=NetBSD]/(title|id)/.' atom.xml
tag:blogger.com,1999:blog-6347225410141611306.post-1131641169617411392
NetBSD quotas - quickstart
tag:blogger.com,1999:blog-6347225410141611306.post-1939815769827620970
NetBSD device drivers - easier than you might think
[...]
</code></pre><p>Those of you yet unfamiliar with the syntax might have some trouble
understanding. The previous pattern could be read "select a text child
of an id or title element, itself a child of an entry element, which
contains a category element which has an attribute child term equal to
NetBSD." Step by step, you should notice that <code>a[b]</code> is read "a such
that b", <code>|</code> stands for "or", <code>/</code> for "child of", <code>@</code> for "attribute",
<code>.</code> for "text", and the braces are used for grouping purposes.
</p><p>Now, let’s select a post by ID:
</p><pre><code>$ xmlgrep -x 'entry[id/.~"post-1939"]#' atom.xml
</code></pre>Or, select a post by title and view its contents using w3m:
<pre><code>$ xmlgrep -x 'entry[title/.~"device drivers"]/content/.' atom.xml |
> sed -e 's/&lt;/</g' -e 's/&gt;/>/g' -e 's/&amp;/\&/g' |
> w3m -T text/html
</code></pre><p>As a side note, I should mention that up to now we have used
subpatterns quite a lot. This is because the Atom feed specification
does not force an order (or does it?) on the children of <code>entry</code>
elements. With more precise knowledge of the order of elements
relative to each other, we could have optimized the pattern to use <code>%</code>
and <code>%%</code> where possible. Subpatterns <em>are</em> costly, but for data sets
this size, we probably don’t care much.
</p><p>Let’s print all entry titles which date from March 2009 using the fact
that we <em>know</em> the <code>updated</code> element comes before the <code>title</code> one:
</p><pre><code>$ xmlgrep -x 'entry/updated[.~"^2009-03"]%%title/.' atom.xml
</code></pre>A friend of mine told me it would be useful to have arithmetic
predicates. I think they will feature in xmltools sooner or later, but
even without them, it is still possible to do some simple statistics,
by combining the results with <b>awk(1)</b>, for example. The following
one-liner counts the number of posts that have no older than March:
<pre><code>$ xmlgrep -x 'entry/updated/.' atom.xml | awk -F - '$2>=3' | wc -l
</code></pre><p>That’s it; I hope this will help people who want to get started with
xmlgrep. If you have other good examples you’d want me to elaborate,
do not hesitate to send me a mail!
</p>
tag:blog.huoc.org,2009:posts/5
XML as a tree representation
2009-06-10T15:33:14+02:002009-06-10T15:33:14+02:00Nhat Minh Lê (rz0)
<p>xmltools is a project whose ultimate goal is to bring Unix
command-line efficiency to the XML world (and not, as many might
think, to bring XML into the Unix world, which is being done already,
though not by my fault). The idea is to treat XML as a generic tree
representation format.
</p><p>This seems very straightforward; in fact, one could argue that XML was
designed from the start to be a generic tree representation. However,
XML has a number of problems that make it hard to use in this role.
</p><p>The first such problems is doctype declarations. While the bare XML
structure, which we see most of the time, is very simple, doctype
declarations and their semantics are both unneeded and unwanted in
tree <i>(syntax)</i> manipulation programs. It would be fine if doctype
support was optional for non-validating XML processors, but as
I understand it, the XML standard mandates support for a portion of
it: entity declarations and references, and default attribute values.
</p><p>The second issue is with white space handling: XML does not include
any default behavior regarding white space. XML parsers must pass all
white space to the application, which is responsible for <em>some
sensible default processing.</em> The <code>xml:space</code> attribute can give
a hint to the application but as far as I know, it’s not in wide use.
</p><p>This all led me to develop a set of <i>compatibility rules</i> which
precisely describe what one should expect from xmltools, no matter
which backend is used (in case we move away from expat, which I plan
to do, eventually). This subset is sufficient to describe any
tree-like data set.
</p><p>The rules will be maintained in the <code>doc/compatxml.text</code> file, in the
xmltools source tree. For convenience, I’ve reproduced the current
version below:
</p><ol><li><p>Do not use doctype declarations for any purpose other than
validation. In particular, do <em>not</em> rely on doctype declarations to
provide default attributes and do <em>not</em> use internal entities for
arbitrary substitutions; if you wanted to save typing, you wouldn’t
be using XML. Use of numeric and predefined entities is OK. Use of
directly-encoded characters is preferred.
</p></li><li><p>Do not presuppose the XML processor has access to any resource
beside your file: do not use external entities.
</p></li><li><p>Do not assume the XML processor is able to handle multiple
character sets: all the documents and other data supplied to the
program should use the same character set. Also, set your locale
appropriately.
</p></li><li><p>Unless <code>xml:space</code> is set, assume all white-space only text
elements are ignored by the XML processor.
</p></li><li><p>Set <code>xml:space</code> to <code>preserve</code> whenever white space not kept by the
previous rule must be preserved. As stated in the first rule, do not
rely on doctype declarations to implicitly set this
attribute. Please note however that the fourth rule works well for
most purposes, including processing of HTML <code>pre</code> elements which
only contain verbatim text.
</p></li></ol>
tag:blog.huoc.org,2009:posts/3
In search of a good pull-style XML parser
2009-06-09T11:07:26+02:002009-06-09T11:07:26+02:00Nhat Minh Lê (rz0)
<p>Recently, I’ve been working with expat extensively, as part of <a class="extern" href="http://netbsd-soc.sourceforge.net/projects/xmltools/">my
Google Summer of Code project</a>, and have come to the realization
that there is, IMHO, no good well-maintained lightweight
stream-oriented XML parser.
</p><p>Of course, one would think <i>expat</i>, at this point, and I thought so
too, when I started, only to find myself now locked with an expat-like
design, which is both inefficient and unnecessarily complicated (for
what I want to do).
</p><p>So, what is this expat design I am talking about? It is the
event-driven (callback-oriented) model, what the XML guys out there
call <i>push</i> parsers, though I personally think it doesn’t make much
sense to call it that way, so I won’t.
</p><p>While event-driven parsing is good for some tasks — awk(1) is
a really clever tool — it really fails on complex parsing tasks such
as those required by my xmltools that rely on <em>partial reprocessing of
the document tree.</em>
</p><p>Reprocessing of the document tree basically means we have to keep an
in-memory copy of the parts we want to traverse more than once. Of
course, expat can (help us) do that. Using expat, the basic idea is to
have two input sources which both call the event handlers. Then,
whenever we need to reprocess some part of the tree, we just traverse
our in-memory copy and feed the information to the handlers as if they
were receiving these from expat.
</p><p>Now, this seems good at first, but the fact is we now have <em>two</em>
parsers: expat and our tree parser. Moreover, they don’t share an
identical behavior: on the expat side, we may need to build new nodes
to put in the partial tree (for future reprocessing), while in the
in-memory tree parser, we may need to free some nodes which are no
longer needed. But even in the expat handling code there may be nodes
we have to keep (usually the current branch, from the root to the
current node); this in turn implies we sometimes have to free some
nodes as well. Now we have bits and pieces of the in-memory tree
management code everywhere. Things start to look messy…
</p><p>Besides, when do we call the surrogate parser? The issue here is we
don’t want expat to feed new data while we are in the middle of (or
more precisely before we can even start) a reprocessing operation. The
expat manual points at the <code>XML_StopParser</code> and <code>XML_ResumeParser</code>
functions. These seem to be exactly what we need, right?
Wrong. Indeed, a more careful reading of the documentation reveals
parsing does not stop immediately after the handler that called
<code>XML_StopParser</code>; there may be an <em>unspecified number of additional
events</em> between the end of that one and the moment the parser actually
stops. This makes these routines pretty much useless. Actually, we
need to use the secondary input system right on top of expat itself,
in the middle of our callbacks, whenever we need this
functionality. This is not pretty to say the least…
</p><p>But there is a more elegant solution: the idea is to have the internal
tree management module drive the process. It maintains the in-memory
tree and calls the event code on its nodes whenever necessary,
possibly going through loops or any patterns that suit its needs, for
that matter. Then, when it requires a node that is not present in the
tree, it <em>calls upon</em> the input module which returns the next node in
the stream. This is the so-called <i>pull method</i>. There really is
nothing special about it; in fact, it’s the way most I/O libraries out
there are built. Why XML had to be different, I don’t know.
</p><p>As I’ve said, to the extent of my knowledge, there is no lightweight
and well-maintained (I mean not some two-year-old experiment by some
obscure hacker, though I respect these things, I don’t have time right
now to play with such codes) C library which implements the full XML
standard and this pull metaphor. If anyone out there knows, do not
hesitate to let me know.
</p>