[Atom] [Mail] [Twitter]
Links: git · hacks · misc · cabal · buzz! · about +
Today’s menu
\/

xmltools: after the GSoC

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 09/07 2009 at 05:14 PM. No comments.
<. xmltools: what's done, what's left
api atf xmltools

Since the end of the Summer of Code, I have been busy doing various miscellaneous tasks around my xmltools, so that it be ready for inclusion into the NetBSD tree some time in the future. This is a short summary of what has happened (just so people know that my project’s not dead, I’ve not vanished, or something along these lines); please take a look at the code repo for details:

ATF test suites

I’ve finally decided to convert my bunch of shell scripts to ATF tests. These are basic tests which check the various constructs of the pattern language. There are 46 tests, three of which are optional (they are made against rather big data sets that I do not distribute with the sources). These include a test suite centered around property list matching. If you have relevant test cases and would like to see them included, please send them over.

Library API

Some people had been asking me if there would be a library, well, yes, and I’ve libified all the existing code in anticipation of that, and also because I wanted to eventually commit somthing clean to the NetBSD tree. All routines have been made to support proper error reporting, and expose clean interfaces. The library is divided into four main sub-namespaces:α

  • hhxml_chunk_* : XML in-memory tree manipulations.
  • hhxml_stream_* : XML streams and (push-style) parsers. The streams have a hybrid design: they can be used to read nodes one at a time in a pure stream fashion or to generate partial in-memory trees. See misc/atomheadlines.c in the source distribution for an example.
  • hhxml_path_* : patterns and matching. This one is incomplete and the old match.c will need to be converted to fit into the new API.
  • hhxml_pprint_* : XML pretty printer. The implementation is also mostly incomplete at this time: it only supports printing to stdout, well, because the command-line tools don’t require any more than that.

The various parts of the API will be completed as I go about writing the tools themselves, but I wanted to have clean well-defined interfaces to build onto.

α: By the way, "hhxml" stands for "(the) very short XML (library)".

Non-visible changes to the pattern matching engine and plans

Well, I’ll write more on this one when I get the time, but I’ve been looking into other implementations and theories, especially SPEX, which boasts algorithms with good complexities (better than what is currently implemented in my tools). I have plans for improvements to the engine which I believe, if I’m not wrong, would bring us similar performance, but that will have to wait, since it is not critically important, at the moment, in my opinion.

That’s all for today!

[ tag:blog.huoc.org,2009:posts/28 ]
No comments · Comment

/\ \/

xmltools: what's done, what's left

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 08/13 2009 at 11:44 PM. No comments.
<. xmltools implementation: automata and backtracking
>. xmltools: after the GSoC
gsoc xmltools

So the Summer of Code ends soon and it’s time to report what has been done, and what is still lacking. Obviously, everybody can see there is no xmlsed yet in the repository, so what have I been doing since I completed xmlgrep one month ago? What went wrong?

Well, if you ask me, nothing really went wrong, I have just deviated somewhat from my original goals and taken the time to do some things that weren’t part of the initial plan. And even though I am disappointed not to have finished, this has brought its share of good additions. Most of these modifications stem from discussions with David (Young) about features we’d like to see in my xmltools. Also in all these three months, I’ve spent about three weeks optimizing the matcher code.

What we have

Let’s talk about the good stuff first. Of course, the visible part of my work is the existence of xmlgrep, but most of the code actually lives in a common static library. It provides:

  • definitions for patterns and trees;
  • utilities: (tree) buffers, bit vectors, symbol look-up, state and transition caching;
  • a string to pattern parser (that I often just call "pattern compiler");
  • an XML parser (which can handle fragments and multi-roots documents, or work as an interface to Expat, for strict XML parsing);
  • an XML pretty printer;
  • the code for the matcher itself;
  • and a higher-level API used to build the tools themselves.

The library has managed to remain reasonably small, with less than 5k lines of code, counting comments and everything, and the core matching algorithm fits in less than 1.5k lines or so (compare this to the 15k-line xpath.c from libxml2). It is stream-oriented and care has been taken so it eats as little memory as possible while remaining efficient.

The pattern matching code

The matcher is undeniably the most important and sensible piece of code in there. It is a blend of a kind-of-NFA and a backtracking machine, which processes one node at the time (NFA) but is able to do unlimited look-ahead (backtracking). Recent work has involved adding new advanced constructs to the matcher (besides forward and look-ahead mechanisms, which were already implemented) and improving its performances when look-ahead is in use.

Performances

On the performance side, the main addition is that of a state cache and pre-computed states created by the pattern compiler (instead of the matching code itself).

The state cache (implemented as a hash table) operates at two levels :

  • propagation (from parent to child and sibling to sibling);
  • and filtering (deciding whether based on local properties, states match or not).

One effect of using a cache is that bit vectors need not be pooled anymore and a lot of copying has disappeared in favor of pointer swapping. Also, relying on states pre-computed by the pattern compiler has shifted the focus from looping bit by bit through bit vectors to whole-vector operations, most of which can be auto-vectorized by a sufficiently smart compiler (ICC on Linux can vectorize a dozen of these in the whole xmlgrep code, but as I’ve pointed before, GCC auto-vectorizer seems to lag behind its non-use, at least on my code and setup).

With these optimizations merged in, xmlgrep now outperforms xmlstarlet both in terms of speed and memory consumption on my silly little benchmark (for those who have not been there, it consists in grepping for a word and its definition in a 50M XML-formatted English dictionary, and yes, it’s simple-minded), even when using suboptimal look-ahead subpatterns.

Indeed, knowledge of the XML format often enables us to write optimized patterns by anchoring a more costly part with a basic predicate. In our case, instead of searching for p[hw/.="somestring"] (the suboptimal case), we could search for body/p[hw/.="somestring"] since p can only occur inside body. This is an improvement since the p[] part, being a look-ahead subpattern, is sensibly slower than a forward pattern, so reducing the number of tentative matches is essential.

Compared to benchmarks done at the time of the release of xmlgrep, handling of this suboptimal case is about three times faster now.

Features

On the feature side, which is probably more interesting for casual would-be users, I have added grouping, Kleene-like operators, and subpattern negation, as well as reworked some of the old syntax.

Subpattern negation is the simplest extension, algthough it’s fairly powerful. It can be achieved because of backtracking so it’s only available on (look-ahead) subpatterns, not groups (see below). Negation lets you write something like {a}!{a/b} which selects all elements a that do not have a b element. This can’t be done with regular patterns because a node can have many children, and thus, unlike with regexes, something like a/!b (would-be syntax) would match some a node with a child which is not a b, but there can be many other children.

For example:

$ xmlgrep -x '{dict}!{dict/@id}/key/.' test.plist

will extract only keys from unnamed dictionaries.

The second addition is Kleene-like operators. Well, sorry about the name, but it’s just what comes to mind when I use these. They are, as the name implies, somewhat like their regex counter part, the Kleene star, in that they match zero or more repetitions of the preceding predicate. E.g. a*%b matches <a/><a/><a/>...<b/> but not, say, <a/><c/><d/>...<b/>. The %% and // operators have now become special cases of the Kleene operators. a%%b can be written a%**%b and a//b is really a/**/b, where the first star stands for "any node" and the second is part of the operator.

But having these repetition operators limited to repeating a single node predicate would have been quite boring, and so groups were born. Groups, written inside parentheses () (notice these have replaced the old subpattern syntax which has been moved back to being {} from its first days), well, group a pattern fragment, so that it can be repeated. They also accept alternation so you can write a%(b|c)*%d, which just seems natural.

Let’s take an example:

$ cat test.xml
<B/><a/><b/><a/><b/><a/><b/><a/><b/><a/><b/><E/>
$ xmlgrep 'B%(a|b)*%E' test.xml
<E/>

Groups are implemented without backtracking so using a group is not much more costly than using plain forward patterns, and is faster than using look-ahead subpatterns.

Yet, that is only true of normal groups. When you use a selection operator (+, #, - or =) on a group, it becomes a capturing group, a.k.a. look-ahead group. A look-ahead group is really a look-ahead subpattern combined with a group and has a somewhat subtle behavior. It behaves like a group, with the following difference: any selection operator inside a look-ahead group will only apply if the whole group matches. This can be used to bind selectors together. Thus we can revisit the classical key/value pair example: how do we match a key/pair with some properties on the key and some other on the value? You can always do that with simple look-ahead subpatterns combined with forward patterns, but groups just make it more convenient.

Imagine you want to match all key/value pairs where the key contains the letter a and the value is a string. Easy, with a capturing group:

$ xmlgrep -x '(key[.~"a"]#%string#)+' test.plist

Other components, reusability

Some people have discussed with me the possibility of making the library reusable. As it is now, it is not yet ready for wider exposure, but a lot of work has gone into rationalizing its API, in order to prepare for that. The API needs to be stabilized and the various components need to be better segregated.

Currently, the data structures in use are tightly integrated. For instance, the parser actually builds a tree with nodes able to hold matching states. Somehow, I would need to break that dependency, so that the matcher depends on the parser, but not the opposite, so you could use the XML parser as a stand-alone component (it has a nice hybrid design). Other parts are less affected, though.

What’s left to do

Now for the bitter part. Obviously, there are still a lot of things left to do, and I intend to continue working on this project, but whether it will be integrated into NetBSD at some point in the future is not for me to decide, though I really hope it will.

In short, here is the plan:

  1. write support for xmlsed templates in the library;
  2. write xmlsed proper;
  3. change the/add a high-level API to the library to make it possible to write xmljoin (because the high level framework only accepts one input file at this point in time);
  4. see what more needs to be adjusted for xmljoin to fit into the framework;
  5. write xmljoin proper;
  6. see what needs to be done for xmlsort to fit into the framework;
  7. write xmlsort.

Of course, I have given some thoughts to the "see what needs to be done" phases, but now is not right time to talk about that.

Also, at some point, I’ll need to address several shortcomings of the current code base:

  • add locale support;
  • convert my bunch of "regression shell scripts" to proper test suites (ATF, probably);
  • add non-NetBSD build support (currently, I have to hand-compile it on my Linux box).

And I will also want to:

  • clean-up the API and release the core components as a stand-alone library;
  • write documentation on the internals to accompany that library;
  • and of course, integrate it into NetBSD, if I am given permission. :)

Lastly, about xmlsed, though you can see no xmlsed directory in the repository, I can say it is pretty close to completion since the pieces needed are almost all there in the library (the only one missing is the template support, and that’s not the hard part).

Conclusion

The Google Summer of Code is nearing its end, but my summer vacations are not over yet. I won’t have classes until September, so I plan to carry on as if nothing much happened at least till the end of August. After that, I won’t have nearly as much free time, but I’ll do what I can; let’s not be hasty and wait till then to see how far the project has gone.

I will continue to update this blog and the code will be available at my Git repository, as usual. (Anyway, I have just recently added comments to my blog so feel free to drop by and give me your opinion.)

[ tag:blog.huoc.org,2009:posts/21 ]
No comments · Comment

/\ \/

xmltools implementation: automata and backtracking

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 07/20 2009 at 12:39 AM. No comments.
<. xmltools update: new I/O layer and further plans
>. xmltools: what's done, what's left
backtracking nfa parsing xmltools

At the moment, I’m implementing submatches in patterns, in preparation for xmlsed. Since most of what I do draws from formal languages and automata (except I’m too lazy to write anything formal), I’ve been looking into TNFAs for inspiration. Seeing how TRE seems to implement look-ahead without backtracking, I began to ask myself if such a thing could be done for my pattern matcher. Unfortunately, I can’t seem to figure out a solution by myself, so I’m blogging instead. :)

Basically, the problem is with constructs such as a[b]/c which match c, child of a only if a has a child b. However, when we reach a, we don’t know yet whether it has a child b or not, so we’re in need of some look-ahead.

This is akin to a(?=b)c in Perl regexes, but as you can already see, it’s fairly different. Perhaps the most obvious difference is that a tree has "two dimensions" while a string has only one, so it’s possible for a to have two children b and c, in a tree, while it’s not possible for a to be followed by two different characters, in a string.

Backtracking

The easiest way to solve this problem is to use backtracking. This is how the matcher is currently implemented. Even though about everything else is based solely on states and transitions, the part implementing look-ahead predicates uses backtracking.

The algorithm looks like this: everytime we have to match a look-ahead predicate, we stop, read as many nodes as we need to decide and then either drop the state or keep it, depending on the outcome.

This is a simple approach and it works well (i.e. it gives the expected results). Only, it’s slow. As a comparison, using the patterns I’ve used before for my benchmarks, the difference between hw/.="Ab\"a*cus" and p[hw/.="Ab\"a*cus"]# is the second one is about 2.66 times slower. This can be hand-optimized by rewriting the second pattern as body/p[hw/.="Ab\"a*cus"]#, with knowledge of the input format, where p can only appear as a child of body. This cuts the slowness to a mere 1.62 times slower.

At the same time, it demonstrates a major source of inefficiency in my backtracking implementation: loose (non-anchored) subpatterns, which may potentially apply to a great number of nodes yet would fail immediately for most of them, result in a lot of unwanted short backtracking segments where the matcher tries the subpattern and then immediately backtracks after failing, thus still examining the node twice.

Product automata

In theory, a pattern such as a(?=b)c matches if a is followed by both b and c so another natural idea would be to use a product automaton which transitions between couples of states: if we end up in two final states, then we win.

Only, this works well because we want to know whether the string matches as a whole. However, with an XML pattern such as a[b]/c, we need to identify which node matched the c part. This is where things start to get complicated.

Now we do know how to extract a submatch from a string with tagged transitions. So, great, does this apply to our XML matcher? The answer is "no", as far as I know (I’d be glad if someone proved me wrong, though). Tagged transitions are good at extracting one submatch, the last one, but in our case, the problem is we want all occurrences of c, not just the last one.

Keeping this information globally associated with the state is not an option since there may well be multiple instances of the look-ahead predicate trying to match at the same time. But trying to differentiate between two states based on the matching position is not an option either, because that could easily lead to linear memory usage. Consider, for example, the following XML fragment:

<a/><a/><a/><a/><a/>...<b/>

Trying to match {a%%b} (or (a%%b) using the current syntax in the master branch; sorry, I reverted to the old syntax because () will be used for groups) against that would result in a match at every a element. Hence, keeping the position as a state property is not a solution, since we’d have to distinguish between as many a nodes as there are in the input data.

Optimizing the backtracking method

Although the implementation of the matcher has undergone many changes in order to improve performances recently, mostly so as to accommodate new features while keeping a reasonable running time, I believe there’s still room for much improvement, maybe in the form of algorithmic enhancements rather than implementation optimizations.

Some of my more realistic ideas are:

  • Read one element ahead to reject some subpatterns without entering a backtracking context.

  • Try to predict the outcome of a look-ahead predicate and compute the more likely transitions along with the predicate itself, saving on backtracking (except to prune the tree) in case we’re right (but incuring additional recomputations if we’re wrong).

Also, this is not directly related to the backtracking approach, but I’m thinking of putting more information in the state cache; at the moment, we’re only caching "raw" transitions which account for the old state set and the direction (i.e. successor or child); we could try to cache local information as well, which would move us one step closer to the way NFA transition functions get cached to get DFAs.

Conclusion: do we need performance that badly?

Well, that’s a good question. And the answer is "we probably don’t". This is kind of my hobby, so don’t get the wrong idea. The whole thing is not that slow. On my simple example, xmlgrep was only 0.5s slower than libxml-powered xmlstarlet, taking less than six seconds to look up a word definition in a 58M dictionary, on my 2x2GHz Core 2, while using 100 times less memory.

I doubt anybody is going to use xmlgrep to look up words in dictionaries anytime soon; such a specialized task would best be served using an indexed collection of some sort, although I should mention that there are XML formats which are more suited to being parsed using my tools than others, so converting to a more appropriate (e.g. annotated) XML format using xmlsed then searching with xmlgrep could well be a viable solution too (given the conversion is one-time or infrequent compared to searches). More on that once xmlsed is out!

[ tag:blog.huoc.org,2009:posts/16 ]
No comments · Comment

/\ \/

xmltools update: new I/O layer and further plans

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 07/08 2009 at 03:45 PM. No comments.
<. xmlsed preview: writing XML
>. xmltools implementation: automata and backtracking
io xml xmltools

It has been nearly two weeks since my last update on my plans for xmlsed. Since that time, David and I have come to an agreement on the (hypothetical) syntax and my hybrid model has been retained.

Rewriting the I/O module

Although I now have a fair idea of what xmlsed should look like and how to implement it, some elements introduced after the recent discussions with my mentor required changes to the existing code.

In particular, the new template syntax required an XML parser able to handle partial XML documents. Besides, I wanted some syntactic sugar on top of XML to make writing templates easier. Expat being a well-behaved XML parser is quite strict about the syntax and there is no easy way to work around that. Though I did spend some days trying, at first, I eventually gave up, as it proved hard to write, let alone maintain in the future.

At the same time, I began to realize the original I/O abstraction I designed was showing its limits. I initially wanted to be able to read and write multiple tree representation formats, but in the end, the need to fully support XML, with all its specifics (doctypes, PIs, etc.), has convinced me to drop the idea and focus on XML alone.

All these reasons led me to rewrite the I/O layer (almost) completely. This work is being committed to yet another temporary branch: xmlpush.

Maybe the most visible change is that multi-backend support was dropped and the I/O module now consists of only two drivers: the Expat-based parser (the strict parser) and a home-made loose parser.

All tools now support two modes: the strict mode (-s flag) and the default loose mode.

  • In the strict mode, compliance with the XML standard is a priority: all extensions are disabled, including multi-root and partial documents (which was implemented with Expat as an ugly kludge before, so it was removed), the XML prolog is honored (the specified encoding is used and entity declarations are parsed), and stricter rules are enforced (e.g. in names).

  • In the loose mode, extensions are supported and the XML prolog is ignored (instead, we use the system locale).

The encoding issue

Although I’ve just stated how encodings will be handled, at the moment, I have not written any code to suppor that yet. Actually, it’s not a simple matter.

First of all, the XML standard mandates support of Unicode. For one thing, XML character entities (&xxxx;) are references to character codes from the Unicode tables. This makes support for Unicode pretty much mandatory.

In order to handle this, the Expat people have chosen to serve all data as UTF-8 (or UTF-16, depending on the build-time configuration), no matter what input encoding is used. This should have given you a hint: Expat has its own encoding conversion engine.

Now, even though XML documents can specify an encoding, and so it would be alright to always use UTF-8, that is not an acceptable solution in the real world: Unix users expect their programs to comply to the current locale. But Expat does not care one way or the other about, or integrate with, the locale system.

Fortunately, NetBSD has iconv(3) (though neither FreeBSD nor OpenBSD does, apparently, which will be a problem in making my programs portable; there seems to be an ongoing GSoC effort to port NetBSD libiconv to FreeBSD though), so I plan to run everything through that on input and again on output. Sounds like a waste of resources? Well, don’t blame me.

However, that’s not all: there is also the loose parser. Since this one was written by me, and I had no reason to force UTF-8 everywhere, it does no conversion and assumes all sources are in the current locale. The only problem will be for character entities (not currently implemented), but I intend to localize the Unicode-to-locale transformation to these, since it is costly.

Remember the push-style vs event-driven parser debate?

Well, it wasn’t really a debate, but maybe some of you remember that I posted some weeks ago a ticket on this blog about how I grew dissatisfied with the rigid event-driven behavior of Expat and wanted a push-style parser.

I took the occasion this time to make my wish come true (almost) and the new I/O design is mostly push-style. I say mostly because I’ve made some compromises in order not to impair performances.

At first, I thought about having the parser run on a fragment of input text and build a queue of events to be fed to the application one by one, on a on-demand basis. But then I realized I could just use the internal tree directly as my event queue, and have the application read that. Since we have support for look-ahead in the matching engine (and hence need to keep a partial internal tree in any case) and most tools will use that, this effectively moved the tree building code from the matcher to the I/O module, at the same time eliminating direct event responses (a bit less than 500 lines of code). This last point needs some clarification: sure we do use a little more memory (but honestly that just doesn’t show) since we build the tree first and only then process it, but this is limited to how many nodes can be represented in one read buffer, which means it’s mostly insignificant. But we have gained what I think is far more important: an unified implementation of the matcher (which is the most sensitive part).

Further plans

At the moment, the new parser does not support things such as entity declarations and I am still pondering whether to write code for that or not. In any case, it would be a good idea to have a xmlcat(1) utility which fully supports XML, including external entities, with the ability to fetch and include external documents (fetch(3)?) and substitute references. But let’s leave this discussion for another time.

As for the locale support, I think it will come after xmlsed is somewhat ready (i.e. as xmlgrep is right now). So for now, some more testing needs to be done on the new I/O components, and when this is done, I will resume work on xmlsed proper.

[ tag:blog.huoc.org,2009:posts/15 ]
No comments · Comment

/\ \/

xmlsed preview: writing XML

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/25 2009 at 12:48 PM. No comments.
<. More benchmarks: the bitvopt branch
>. xmltools update: new I/O layer and further plans
templates xmlsed xmltools

One might be tempted to think that, while parsing can be a complex task involving various competing models, printing is straightforward and should be the easy part of the job. This is not quite true however, as David and I have been in disagreement for some time already over the exact semantics and syntax of a potential XML printing subsystem of xmltools.

This post is all about the design of the printing language in xmltools and the underlying model. I also present at the end my latest hybrid proposal, that I’ve submitted to David for approval earlier today, and which I hope will bring satisfaction to all parties. If you have some opinion on this topic, do not hesitate to e-mail me! (My e-mail is my first name followed by my last name with all spaces replaced by dots; it’s written in French in the footer so you might not be able to decode it.)

The basic problem can be described as follows: we have a source tree, the input document and we want to transform it by adding or removing nodes in some places. Simple? Not quite.

Select & transform

The first design I proposed is based on sed(1): we select some node in the tree using a tree pattern (same as in my xmlgrep) and then transform it using some actions: insert, append, replace, etc. Only these needed to be adapted to XML.

My idea was then to use PYX/ESIS-inspired actions, which also happen to model closely my internal API. These consisted of a one-character prefix indicating the thing to print, followed by contents (depending on the prefix). Prefixes were:

( opening tag
) closing tag
@ attribute
. text
! comment
? junk

This is a very simple approach but David objected that it was unnatural to write XML using some language which was not XML.

Advantages
  • Very simple and straightforward.
  • Completely symmetrical to the way patterns are matched.
Disadvantages
  • Code used to write XML does not look like the output.
  • Not intuitive to read as data.

Decorate & pull

Another approach, which is used most notably by XSLT, is to have a template, which itself is written as XML, in which some elements from the source document are pulled.

Let me say it upfront in case anybody is not clear on this: pulling is not a stream technique, so this model as is cannot be implemented easily (some subsets of XSLT as stream transformations have been researched, but since it is from the start not a stream-oriented technique, it’s complex and IMHO not worth it).

David first suggested using a template as a way of having printing-heavy script actually look like their output. He observed plain XML could be written directly while some special elements, e.g. xmlsed:addattr could be used to print things that cannot be represented directly in XML. Indeed, there is no XML syntax to write just an attribute without the surrounding tag, yet there are cases where we might want to add an attribute to an existing element.

I criticized this idea based on the fact that reserving a namespace was intrusive and that using XML to represent scripting elements was too verbose.

But the main point was that pull-style transformation was never an option.

Advantages
  • XML in script remains in output.
  • Very intuitive and easy to read.
Disadvantages
  • Model is not adapted to streaming.
  • Too verbose.

Towards a hybrid model: XML syntax for select & transform

Given the stream-unfriendly nature of the template-based model, David came up with the idea to represent actions of the select & transform model as XML: XML opening and closing tags could be used to represent the corresponding printing actions (respectively ( and )) but for the rest, we were still stuck with the way-too-verbose XSLT-like <xmlsed:whatever> elements.

I was still very dissatisfied with the verbosity of having to write <xmlsed:whatever/> just to add an attribute or specify where to insert the current node (before, after, inside the plain XML part), but the concept itself, of using a template, was quite nice.

Advantages
  • XML in script remains in output.
  • Very intuitive and easy to read.
  • Still reasonably simple to implement.
Disadvantages
  • Still too verbose.

A hybrid model

Today, after more brainstorming, I finally decided on the following model, and unless David is really against it (which I seriously doubt), I think it will make it into xmlsed one way or the other: the idea is the same as that of the previous attempt, use a template which actually describes actions inferred from the structure of XML itself.

Only, this time, we do not use special elements to indicate actions anymore, everything is inferred from the structure. The trick to do this is to have a virtual element represent the current node; it can have any name (it will be user-settable), let’s call it : by default. We can place it wherever we want in the template. And to add an attribute (resp. a child), we just have to add an attribute (resp. a child) to that virtual element. The syntax remains reasonably compact (even optimally compact given we’re using XML as the basis for our template) and the script now very closely models the output, even more than XSLT scripts!

The idea was inspired from xargs(1) and its -I option. Again, Unix saved the day… well, not quite, but still, I think it is a nice and practical compromise. :)

[ tag:blog.huoc.org,2009:posts/10 ]
No comments · Comment

/\

More benchmarks: the bitvopt branch

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/21 2009 at 11:28 AM. No comments.
<. xmlgrep by examples: playing with Atom
>. xmlsed preview: writing XML
benchmarks gcc icc performance vectorization xmltools

Those who have been keeping an eye on my Git repository have surely noticed the recent update of the bitvopt branch, which contains alternate implementations of the core routines of the matching engine (match.c): followsibling and followparent. The bitvopt code replaces the direct tests and propagation entirely with bit vector operations (masking, shifting, etc.).

The main advantage of this approach is vectorization. Such operations are easily done on a single unit (the type of which is defined in bitv.h and somewhat configurable with the USE_STDINT macro) and can be generalized to loops that carry no dependencies between iterations. These loops can then be auto-vectorized by a sufficiently smart compiler. The current code in bitvopt is understood and vectorized by both GCC 4 with -ftree-vectorize and ICC 10.1. Beware, ICC 10.0 fails because of the restrict qualifiers, which are necessary; also note that, on x86, you have to pass USE_STDINT in order to select a type bigger than unsigned char, or else, SIMD shift operations won’t be available. To see the vectorization reports, you can use -ftree-vectorizer-verbose=5 for GCC and -vec-report for ICC.

The results are not quite satisfactory and the branch was not merged into master:

GCC auto-vectorization not doing well…

On the chart, GCC has vectorization disabled by default while ICC always has vectorization (even though I have not specified -vect anywhere). The measurements have been done on my old Pentium 4, with -O2 and appropriate architecture flags, with GCC 4.1.2 and ICC 10.1. The values are averages computed over 100 runs of the following pattern on the same dictionary as before:

$ xmlgrep 'p[hw/?="Ab\"a*cus"]#'

I must say I can’t explain why operations on native integers (the -int builds) are slower than those on bytes. As for the relative gain of non-vectorized code over vectorized one, this is most probably because the pattern is not long enough (less than one base unit) to take advantage of SIMD so we lose in the overhead introduced with vectorization (all the more so as data may be unaligned on entry of these functions). Lastly, it seems auto-vectorization in GCC really is not advantageous right now, since the non-vectorized versions tend to outperform their vectorized counterparts when compiled with GCC (and this is not the case with ICC).

Anyway, this was just a side experiment of mine; work on the main tools continue: plans for more features in the matcher (on request, case insensitivity, arithmetic predicates, maybe some other things) as well as the introduction of xmlsed are underway.

[ tag:blog.huoc.org,2009:posts/9 ]
No comments · Comment

>> Page: 0 1