Ours & Hippy — le blogOurs & Hippyourshippy@huoc.orgtag:blog.huoc.org,2009:atom2009-07-08T15:45:28+02:00tag:blog.huoc.org,2009:posts/15
xmltools update: new I/O layer and further plans
2009-07-08T15:45:28+02:002009-07-08T15:45:28+02:00Nhat Minh Lê (rz0)
<p>It has been nearly two weeks since my last update on my plans for
xmlsed. Since that time, David and I have come to an agreement on the
(hypothetical) syntax and my hybrid model has been retained.
</p><h3>Rewriting the I/O module
</h3><p>Although I now have a fair idea of what xmlsed should look like and
how to implement it, some elements introduced after the recent
discussions with my mentor required changes to the existing code.
</p><p>In particular, the new template syntax required an XML parser able to
handle partial XML documents. Besides, I wanted some syntactic sugar
on top of XML to make writing templates easier. Expat being
a well-behaved XML parser is quite strict about the syntax and there
is no easy way to work around that. Though I did spend some days
trying, at first, I eventually gave up, as it proved hard to write,
let alone maintain in the future.
</p><p>At the same time, I began to realize the original I/O abstraction
I designed was showing its limits. I initially wanted to be able to
read and write multiple tree representation formats, but in the end,
the need to fully support XML, with all its specifics (doctypes, PIs,
etc.), has convinced me to drop the idea and focus on XML alone.
</p><p>All these reasons led me to rewrite the I/O layer (almost)
completely. This work is being committed to yet another temporary
branch: <code>xmlpush</code>.
</p><p>Maybe the most visible change is that multi-backend support was
dropped and the I/O module now consists of only two drivers: the
Expat-based parser (the <i>strict</i> parser) and a home-made <i>loose</i>
parser.
</p><p>All tools now support two modes: the strict mode (<code>-s</code> flag) and the
default loose mode.
</p><ul><li><p>In the <strong>strict mode</strong>, compliance with the XML standard is
a priority: all extensions are disabled, including multi-root and
partial documents (which was implemented with Expat as an ugly
kludge before, so it was removed), the XML prolog is honored (the
specified encoding is used and entity declarations are parsed), and
stricter rules are enforced (e.g. in names).
</p></li><li><p>In the <strong>loose mode</strong>, extensions are supported and the XML prolog
is ignored (instead, we use the system locale).
</p></li></ul><h3>The encoding issue
</h3><p>Although I’ve just stated how encodings will be handled, at the
moment, I have not written any code to suppor that yet. Actually, it’s
not a simple matter.
</p><p>First of all, the XML standard mandates support of Unicode. For one
thing, XML character entities (<code>&xxxx;</code>) are references to character
codes from the Unicode tables. This makes support for Unicode pretty
much mandatory.
</p><p>In order to handle this, the Expat people have chosen to serve all
data as UTF-8 (or UTF-16, depending on the build-time configuration),
no matter what input encoding is used. This should have given you
a hint: Expat has its own encoding conversion engine.
</p><p>Now, even though XML documents can specify an encoding, and so it
would be alright to always use UTF-8, that is not an acceptable
solution in the real world: Unix users expect their programs to comply
to the current locale. But Expat does not care one way or the other
about, or integrate with, the locale system.
</p><p>Fortunately, NetBSD has <b>iconv(3)</b> (though neither FreeBSD nor
OpenBSD does, apparently, which will be a problem in making my
programs portable; there seems to be an <a class="extern" href="http://kovesdan.org/doc/en_US.ISO8859-1/soc2009/soc2009.html">ongoing GSoC effort to port
NetBSD libiconv to FreeBSD</a> though), so I plan to run
everything through that on input and again on output. Sounds like
a waste of resources? Well, don’t blame me.
</p><p>However, that’s not all: there is also the loose parser. Since this
one was written by me, and I had no reason to force UTF-8 everywhere,
it does no conversion and assumes all sources are in the current
locale. The only problem will be for character entities (not currently
implemented), but I intend to localize the Unicode-to-locale
transformation to these, since it is costly.
</p><h3>Remember the push-style vs event-driven parser debate?
</h3><p>Well, it wasn’t really a debate, but maybe some of you remember that
I posted some weeks ago a ticket on this blog about how I grew
dissatisfied with the rigid event-driven behavior of Expat and wanted
a push-style parser.
</p><p>I took the occasion this time to make my wish come true (almost) and
the new I/O design is <em>mostly</em> push-style. I say <i>mostly</i> because I’ve
made some compromises in order not to impair performances.
</p><p>At first, I thought about having the parser run on a fragment of input
text and build a queue of events to be fed to the application one by
one, on a on-demand basis. But then I realized I could just use the
internal tree directly as my <i>event queue</i>, and have the application
read that. Since we have support for look-ahead in the matching engine
(and hence need to keep a <em>partial</em> internal tree in any case) and
most tools will use that, this effectively moved the tree building
code from the matcher to the I/O module, at the same time eliminating
direct event responses (a bit less than 500 lines of code). This last
point needs some clarification: sure we do use a little more memory
(but honestly that just doesn’t show) since we build the tree first
and only then process it, but this is limited to how many nodes can be
represented in one read buffer, which means it’s mostly
insignificant. But we have gained what I think is far more important:
an unified implementation of the matcher (which <em>is</em> the most
sensitive part).
</p><h3>Further plans
</h3><p>At the moment, the new parser does not support things such as entity
declarations and I am still pondering whether to write code for that
or not. In any case, it would be a good idea to have a <b>xmlcat(1)</b>
utility which fully supports XML, including external entities, with
the ability to fetch and include external documents (<b>fetch(3)</b>?)
and substitute references. But let’s leave this discussion for another
time.
</p><p>As for the locale support, I think it will come after xmlsed is
somewhat ready (i.e. as xmlgrep is right now). So for now, some more
testing needs to be done on the new I/O components, and when this is
done, I will resume work on xmlsed proper.
</p>