Ours & Hippy — le blogOurs & Hippyourshippy@huoc.orgtag:blog.huoc.org,2009:atom2009-07-20T00:39:10+02:00tag:blog.huoc.org,2009:posts/16
xmltools implementation: automata and backtracking
2009-07-20T00:39:10+02:002009-07-20T00:39:10+02:00Nhat Minh Lê (rz0)
<p>At the moment, I’m implementing submatches in patterns, in preparation
for xmlsed. Since most of what I do draws from formal languages and
automata (except I’m too lazy to write anything formal), I’ve been
looking into <a class="extern" href="http://www.laurikari.net/ville/regex-submatch.pdf">TNFAs</a> for inspiration. Seeing how <a class="extern" href="http://laurikari.net/tre/">TRE</a> seems to
implement look-ahead without backtracking, I began to ask myself if
such a thing could be done for my pattern matcher. Unfortunately,
I can’t seem to figure out a solution by myself, so I’m blogging
instead. :)
</p><p>Basically, the problem is with constructs such as <code>a[b]/c</code> which match
<code>c</code>, child of <code>a</code> only if <code>a</code> has a child <code>b</code>. However, when we reach
<code>a</code>, we don’t know yet whether it has a child <code>b</code> or not, so we’re in
need of some <em>look-ahead</em>.
</p><p>This is akin to <code>a(?=b)c</code> in Perl regexes, but as you can already see,
it’s fairly different. Perhaps the most obvious difference is that
a tree has "two dimensions" while a string has only one, so it’s
possible for <code>a</code> to have two children <code>b</code> and <code>c</code>, in a tree, while
it’s not possible for <code>a</code> to be followed by two different characters,
in a string.
</p><h3>Backtracking
</h3><p>The easiest way to solve this problem is to use backtracking. This is
how the matcher is currently implemented. Even though about everything
else is based solely on states and transitions, the part implementing
look-ahead predicates uses backtracking.
</p><p>The algorithm looks like this: everytime we have to match a look-ahead
predicate, we stop, read as many nodes as we need to decide and then
either drop the state or keep it, depending on the outcome.
</p><p>This is a simple approach and it works well (i.e. it gives the
expected results). Only, it’s slow. As a comparison, using the
patterns I’ve used before for my benchmarks, the difference between
<code>hw/.="Ab\"a*cus"</code> and <code>p[hw/.="Ab\"a*cus"]#</code> is the second one is
about 2.66 times slower. This can be hand-optimized by rewriting the
second pattern as <code>body/p[hw/.="Ab\"a*cus"]#</code>, with knowledge of the
input format, where <code>p</code> can only appear as a child of <code>body</code>. This
cuts the slowness to a mere 1.62 times slower.
</p><p>At the same time, it demonstrates a major source of inefficiency in my
backtracking implementation: loose (non-anchored) subpatterns, which
may potentially apply to a great number of nodes yet would fail
immediately for most of them, result in a lot of unwanted short
backtracking segments where the matcher tries the subpattern and then
immediately backtracks after failing, thus still examining the node
twice.
</p><h3>Product automata
</h3><p>In theory, a pattern such as <code>a(?=b)c</code> matches if <code>a</code> is followed by
<em>both</em> <code>b</code> and <code>c</code> so another natural idea would be to use a product
automaton which transitions between couples of states: if we end up in
two final states, then we win.
</p><p>Only, this works well because we want to know whether the string
matches as a whole. However, with an XML pattern such as <code>a[b]/c</code>, we
need to identify <em>which</em> node matched the <code>c</code> part. This is where
things start to get complicated.
</p><p>Now we do know how to extract a submatch from a string with tagged
transitions. So, great, does this apply to our XML matcher? The answer
is "no", as far as I know (I’d be glad if someone proved me wrong,
though). Tagged transitions are good at extracting <em>one</em> submatch, the
last one, but in our case, the problem is we want <em>all</em> occurrences of
<code>c</code>, not just the last one.
</p><p>Keeping this information globally associated with the state is not an
option since there may well be multiple instances of the look-ahead
predicate trying to match at the same time. But trying to
differentiate between two states based on the matching position is not
an option either, because that could easily lead to linear memory
usage. Consider, for example, the following XML fragment:
</p><pre><code><a/><a/><a/><a/><a/>...<b/>
</code></pre><p>Trying to match <code>{a%%b}</code> (or <code>(a%%b)</code> using the current syntax in the
master branch; sorry, I reverted to the old syntax because <code>()</code> will
be used for groups) against that would result in a match at every <code>a</code>
element. Hence, keeping the position as a state property is not
a solution, since we’d have to distinguish between as many <code>a</code> nodes
as there are in the input data.
</p><h3>Optimizing the backtracking method
</h3><p>Although the implementation of the matcher has undergone many changes
in order to improve performances recently, mostly so as to accommodate
new features while keeping a reasonable running time, I believe
there’s still room for much improvement, maybe in the form of
algorithmic enhancements rather than implementation optimizations.
</p><p>Some of my more realistic ideas are:
</p><ul><li><p>Read one element ahead to reject some subpatterns without entering
a backtracking context.
</p></li><li><p>Try to predict the outcome of a look-ahead predicate and compute the
more likely transitions along with the predicate itself, saving on
backtracking (except to prune the tree) in case we’re right (but
incuring additional recomputations if we’re wrong).
</p></li></ul><p>Also, this is not directly related to the backtracking approach, but
I’m thinking of putting more information in the state cache; at the
moment, we’re only caching "raw" transitions which account for the old
state set and the direction (i.e. successor or child); we could try to
cache local information as well, which would move us one step closer
to the way NFA transition functions get cached to get DFAs.
</p><h3>Conclusion: do we need performance that badly?
</h3><p>Well, that’s a good question. And the answer is "we probably
don’t". This is kind of my hobby, so don’t get the wrong idea. The
whole thing is not that slow. On my simple example, xmlgrep was only
0.5s slower than libxml-powered xmlstarlet, taking less than six
seconds to look up a word definition in a 58M dictionary, on my 2x2GHz
Core 2, while using 100 times less memory.
</p><p>I doubt anybody is going to use xmlgrep to look up words in
dictionaries anytime soon; such a specialized task would best be
served using an indexed collection of some sort, although I should
mention that there are XML formats which are more suited to being
parsed using my tools than others, so converting to a more appropriate
(e.g. annotated) XML format using xmlsed then searching with xmlgrep
could well be a viable solution too (given the conversion is one-time
or infrequent compared to searches). More on that once xmlsed is out!
</p>