[Talk] [Atom] [Talk Atom]    [Git] [Mail] [Reload]
Links: mdown · hacks · misc · cabal · about +
Today’s menu
\/

More benchmarks: the bitvopt branch

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/21 2009 at 11:28 AM. No comments.
<. xmlgrep by examples: playing with Atom
>. xmlsed preview: writing XML
benchmarks gcc icc performance vectorization xmltools

Those who have been keeping an eye on my Git repository have surely noticed the recent update of the bitvopt branch, which contains alternate implementations of the core routines of the matching engine (match.c): followsibling and followparent. The bitvopt code replaces the direct tests and propagation entirely with bit vector operations (masking, shifting, etc.).

The main advantage of this approach is vectorization. Such operations are easily done on a single unit (the type of which is defined in bitv.h and somewhat configurable with the USE_STDINT macro) and can be generalized to loops that carry no dependencies between iterations. These loops can then be auto-vectorized by a sufficiently smart compiler. The current code in bitvopt is understood and vectorized by both GCC 4 with -ftree-vectorize and ICC 10.1. Beware, ICC 10.0 fails because of the restrict qualifiers, which are necessary; also note that, on x86, you have to pass USE_STDINT in order to select a type bigger than unsigned char, or else, SIMD shift operations won’t be available. To see the vectorization reports, you can use -ftree-vectorizer-verbose=5 for GCC and -vec-report for ICC.

The results are not quite satisfactory and the branch was not merged into master:

GCC auto-vectorization not doing well…

On the chart, GCC has vectorization disabled by default while ICC always has vectorization (even though I have not specified -vect anywhere). The measurements have been done on my old Pentium 4, with -O2 and appropriate architecture flags, with GCC 4.1.2 and ICC 10.1. The values are averages computed over 100 runs of the following pattern on the same dictionary as before:

$ xmlgrep 'p[hw/?="Ab\"a*cus"]#'

I must say I can’t explain why operations on native integers (the -int builds) are slower than those on bytes. As for the relative gain of non-vectorized code over vectorized one, this is most probably because the pattern is not long enough (less than one base unit) to take advantage of SIMD so we lose in the overhead introduced with vectorization (all the more so as data may be unaligned on entry of these functions). Lastly, it seems auto-vectorization in GCC really is not advantageous right now, since the non-vectorized versions tend to outperform their vectorized counterparts when compiled with GCC (and this is not the case with ICC).

Anyway, this was just a side experiment of mine; work on the main tools continue: plans for more features in the matcher (on request, case insensitivity, arithmetic predicates, maybe some other things) as well as the introduction of xmlsed are underway.

[ tag:blog.huoc.org,2009:posts/9 ]
No comments · Comment

/\ \/

xmlgrep by examples: playing with Atom

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Updated on 06/22 2009 at 02:30 PM. No comments.
<. Improving performances of xmlgrep
>. More benchmarks: the bitvopt branch
atom tutorial xml xmltools

Edited 22.06. Incompatible syntax changes and more examples.

Since Matthew Sporleder on tech-userlevel has (implicitly) suggested that xmlgrep could be used to dissect Atom feeds, but was a bit lost at how to do it exactly, I thought I’d post a little demo of a console session playing with an Atom file and xmlgrep (as well as some other command-line tools).

First, let’s fetch the feed:

$ ftp http://mspo.com/blog/atom.xml
[...]
As a side effect of xmlgrep, we might want to indent the XML file to make it human-readable:
$ xmlgrep '*' atom.xml | more
List all posts in the NetBSD category with their IDs:
$ xmlgrep -x 'entry[category/@term=NetBSD]/(title|id)/.' atom.xml
tag:blogger.com,1999:blog-6347225410141611306.post-1131641169617411392
NetBSD quotas - quickstart
tag:blogger.com,1999:blog-6347225410141611306.post-1939815769827620970
NetBSD device drivers - easier than you might think
[...]

Those of you yet unfamiliar with the syntax might have some trouble understanding. The previous pattern could be read "select a text child of an id or title element, itself a child of an entry element, which contains a category element which has an attribute child term equal to NetBSD." Step by step, you should notice that a[b] is read "a such that b", | stands for "or", / for "child of", @ for "attribute", . for "text", and the braces are used for grouping purposes.

Now, let’s select a post by ID:

$ xmlgrep -x 'entry[id/.~"post-1939"]#' atom.xml
Or, select a post by title and view its contents using w3m:
$ xmlgrep -x 'entry[title/.~"device drivers"]/content/.' atom.xml |
> sed -e 's/&lt;/</g' -e 's/&gt;/>/g' -e 's/&amp;/\&/g' |
> w3m -T text/html

As a side note, I should mention that up to now we have used subpatterns quite a lot. This is because the Atom feed specification does not force an order (or does it?) on the children of entry elements. With more precise knowledge of the order of elements relative to each other, we could have optimized the pattern to use % and %% where possible. Subpatterns are costly, but for data sets this size, we probably don’t care much.

Let’s print all entry titles which date from March 2009 using the fact that we know the updated element comes before the title one:

$ xmlgrep -x 'entry/updated[.~"^2009-03"]%%title/.' atom.xml
A friend of mine told me it would be useful to have arithmetic predicates. I think they will feature in xmltools sooner or later, but even without them, it is still possible to do some simple statistics, by combining the results with awk(1), for example. The following one-liner counts the number of posts that have no older than March:
$ xmlgrep -x 'entry/updated/.' atom.xml | awk -F - '$2>=3' | wc -l

That’s it; I hope this will help people who want to get started with xmlgrep. If you have other good examples you’d want me to elaborate, do not hesitate to send me a mail!

[ tag:blog.huoc.org,2009:posts/8 ]
No comments · Comment

/\ \/

Improving performances of xmlgrep

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/16 2009 at 12:28 PM. No comments.
<. xmlgrep toy benchmarks
>. xmlgrep by examples: playing with Atom
memory performance vectorization xmltools

After spending some days fixing bugs in the xmlgrep matcher code, I took some time again to look at the performance issues. The previous measurements showed xmlgrep was slower than xmlstarlet, even on patterns without subpatterns, which are theoretically optimal for my implementation. That really bothered me.

Profiling seemed to show xmlgrep spent most of its time in bit vector operations – that much was expected. But my vain attempt (Git branch bitvopt) at replacing tests (the FOREACHVSTATE macro) with bit vector operations proved actually slower than the master version.

Then I had a talk with David Young, my mentor, and we concluded that both the input mechanism and the memory allocation pattern were suboptimal.

About the input problem, I’ve replaced the old input feeder, which used fgets(3), with direct read(2) calls, and this has indeed shaved some seconds off the counter. This is especially noticeable for patterns without subpatterns, since I/O accounts for more in that case.

The memory allocation issue was more serious. Basically, a lot of bit vectors and subcontext frames were needlessly allocated then almost immediately freed afterwards. This is due to the following construct commonly found in patterns (maybe I should optimize this one case specifically): patterns which start with an anchored subpattern, typically something like a[b] or {a%b}. This kind of patterns causes the matcher to try to match every node in the tree against the subpattern yet fail almost all the time (on all those nodes that do not match a). Such a failed subpattern match entails a call to pushsub followed by a call to popsub, which are responsible for the management of the subcontext stack.

I’ve switched to pool allocators for both internal tree nodes and bit vectors. This has drastically reduced the number of allocations made during a run. David also pointed out that I could use a pre-allocated vector instead of a linked list for the subcontext stack, and I remembered one of the nice properties of my implementation: the number of subcontext is bounded by the maximal nesting depth of subpatterns. He also suggested I should use copy-on-write bit vectors, but that proved unnecessary as I now use pointer swapping instead of copying then erasing (which was quite stupid in the first place, I agree).

All these changes were developed in the memopt2 branch overnight and are now in master. The same measurements as before show some exciting improvements, most notably the fact that we are now faster than libxml on patterns that do not use look-ahead! Basically, on this example, we are about 50% faster than before.

Revisited speed comparison
[ tag:blog.huoc.org,2009:posts/7 ]
No comments · Comment

/\ \/

xmlgrep toy benchmarks

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/12 2009 at 01:06 AM. No comments.
<. XML as a tree representation
>. Improving performances of xmlgrep
benchmarks memory performance xmlstarlet xmltools xmltwig

As xmlgrep is approaching a pre-pre-alpha release (something that is still very experimental, but nonetheless working, somewhat :), I’ve been doing some basic tests on it, including one simple benchmark. The results are, IMHO, encouraging. The benchmark pits xmlgrep against equivalent existing software:
  • the selection tool from the xmlstarlet project (available in pkgsrc as textproc/xmlstarlet);
  • another xmlgrep from the Perl Twig package (textproc/p5-XML-Twig in pkgsrc);
  • and GNU grep, though it does not really accomplish the same, from the base NetBSD-5 distribution, as a reference.

Since each tool accepts its own patterns and treats them differently, I’ve tried my best to get meaningful results with each, but obviously, the behaviors obtained differ somewhat.

The tests/memplot script was used to make the sampling. The data were plotted using Gnuplot. The following commands were run:

$ xmlgrep 'hw/?="Ab\"a*cus"' d.xml
$ xmlgrep 'hw[?="Ab\"a*cus"]' d.xml
$ xml sel -t -c "//hw[text()='Ab&quot;a*cus']" d.xml
$ xml_grep "hw[string()='Ab\"a*cus']" d.xml
$ grep 'Ab"a\*cus' d.xml

The goal was to retrieve the <hw> element which contains the Ab"a*cus string from a 53M dictionary, retrieved from http://www.ibiblio.org/webster/ and merged into a single file.

The first two lines correspond to two different invocations of xmlgrep, which do not do exactly the same thing, and with the first being the good one, the other inefficiently abusing the look-ahead mechanism. Yet, it was included for comparison purposes and maybe to be fair to the XPath-based alternatives, which have to rely on a predicate.

Results follow; the second and third pictures are just zooms which omit one or another of the candidates.

I know these results don’t mean much, so don’t take them too seriously. I’ve only tested a single simple pattern. Actually, I didn’t really intend to make a comparison, at first, I just wanted to make sure xmlgrep doesn’t leak or otherwise misuse memory. But since the sampling code was there, I thought it’d be fun to do some additional measurements.

Overall benchmark
Low-memory candidates benchmark
Fast candidates benchmark

As expected, xmlstarlet, which uses a full DOM model requires a lot of memory to do its work (more than 400M!). But it is the fastest.

My xmlgrep, when used right, is still nearly 42% slower than xmlstarlet (from above 4s to below 6s), but I think the times remain reasonable; memory-wise, it is also the lightest, with a constant 2.9M in use throughout the whole run. Even GNU grep requires about 2.8M.

Surprisingly, the Twig xmlgrep is not too big a memory killer, though its memory usage increases over time, by steps (though it looks linear on the long run). This appears somewhat strange to me; why some nodes should be pruned from its internal DOM-like tree while others seem to remain for the duration of the program. However, on the speed side, Twig is predictably very slow; it takes nearly two minutes to produce the results comparable to those of its C counterparts.

That’s it for tonight; as one friend of mine told me, more interesting benchmarks will probably come from actual users.

[ tag:blog.huoc.org,2009:posts/6 ]
No comments · Comment

/\ \/

XML as a tree representation

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/10 2009 at 03:33 PM. No comments.
<. In search of a good pull-style XML parser
>. xmlgrep toy benchmarks
trees xml xmltools

xmltools is a project whose ultimate goal is to bring Unix command-line efficiency to the XML world (and not, as many might think, to bring XML into the Unix world, which is being done already, though not by my fault). The idea is to treat XML as a generic tree representation format.

This seems very straightforward; in fact, one could argue that XML was designed from the start to be a generic tree representation. However, XML has a number of problems that make it hard to use in this role.

The first such problems is doctype declarations. While the bare XML structure, which we see most of the time, is very simple, doctype declarations and their semantics are both unneeded and unwanted in tree (syntax) manipulation programs. It would be fine if doctype support was optional for non-validating XML processors, but as I understand it, the XML standard mandates support for a portion of it: entity declarations and references, and default attribute values.

The second issue is with white space handling: XML does not include any default behavior regarding white space. XML parsers must pass all white space to the application, which is responsible for some sensible default processing. The xml:space attribute can give a hint to the application but as far as I know, it’s not in wide use.

This all led me to develop a set of compatibility rules which precisely describe what one should expect from xmltools, no matter which backend is used (in case we move away from expat, which I plan to do, eventually). This subset is sufficient to describe any tree-like data set.

The rules will be maintained in the doc/compatxml.text file, in the xmltools source tree. For convenience, I’ve reproduced the current version below:

  1. Do not use doctype declarations for any purpose other than validation. In particular, do not rely on doctype declarations to provide default attributes and do not use internal entities for arbitrary substitutions; if you wanted to save typing, you wouldn’t be using XML. Use of numeric and predefined entities is OK. Use of directly-encoded characters is preferred.

  2. Do not presuppose the XML processor has access to any resource beside your file: do not use external entities.

  3. Do not assume the XML processor is able to handle multiple character sets: all the documents and other data supplied to the program should use the same character set. Also, set your locale appropriately.

  4. Unless xml:space is set, assume all white-space only text elements are ignored by the XML processor.

  5. Set xml:space to preserve whenever white space not kept by the previous rule must be preserved. As stated in the first rule, do not rely on doctype declarations to implicitly set this attribute. Please note however that the fourth rule works well for most purposes, including processing of HTML pre elements which only contain verbatim text.

[ tag:blog.huoc.org,2009:posts/5 ]
No comments · Comment

/\

In search of a good pull-style XML parser

#. By rz0 in GSoC 2009 for NetBSD: xmltools. Posted on 06/09 2009 at 11:07 AM. No comments.
>. XML as a tree representation
api parsing xml xmltools

Recently, I’ve been working with expat extensively, as part of my Google Summer of Code project, and have come to the realization that there is, IMHO, no good well-maintained lightweight stream-oriented XML parser.

Of course, one would think expat, at this point, and I thought so too, when I started, only to find myself now locked with an expat-like design, which is both inefficient and unnecessarily complicated (for what I want to do).

So, what is this expat design I am talking about? It is the event-driven (callback-oriented) model, what the XML guys out there call push parsers, though I personally think it doesn’t make much sense to call it that way, so I won’t.

While event-driven parsing is good for some tasks — awk(1) is a really clever tool — it really fails on complex parsing tasks such as those required by my xmltools that rely on partial reprocessing of the document tree.

Reprocessing of the document tree basically means we have to keep an in-memory copy of the parts we want to traverse more than once. Of course, expat can (help us) do that. Using expat, the basic idea is to have two input sources which both call the event handlers. Then, whenever we need to reprocess some part of the tree, we just traverse our in-memory copy and feed the information to the handlers as if they were receiving these from expat.

Now, this seems good at first, but the fact is we now have two parsers: expat and our tree parser. Moreover, they don’t share an identical behavior: on the expat side, we may need to build new nodes to put in the partial tree (for future reprocessing), while in the in-memory tree parser, we may need to free some nodes which are no longer needed. But even in the expat handling code there may be nodes we have to keep (usually the current branch, from the root to the current node); this in turn implies we sometimes have to free some nodes as well. Now we have bits and pieces of the in-memory tree management code everywhere. Things start to look messy…

Besides, when do we call the surrogate parser? The issue here is we don’t want expat to feed new data while we are in the middle of (or more precisely before we can even start) a reprocessing operation. The expat manual points at the XML_StopParser and XML_ResumeParser functions. These seem to be exactly what we need, right? Wrong. Indeed, a more careful reading of the documentation reveals parsing does not stop immediately after the handler that called XML_StopParser; there may be an unspecified number of additional events between the end of that one and the moment the parser actually stops. This makes these routines pretty much useless. Actually, we need to use the secondary input system right on top of expat itself, in the middle of our callbacks, whenever we need this functionality. This is not pretty to say the least…

But there is a more elegant solution: the idea is to have the internal tree management module drive the process. It maintains the in-memory tree and calls the event code on its nodes whenever necessary, possibly going through loops or any patterns that suit its needs, for that matter. Then, when it requires a node that is not present in the tree, it calls upon the input module which returns the next node in the stream. This is the so-called pull method. There really is nothing special about it; in fact, it’s the way most I/O libraries out there are built. Why XML had to be different, I don’t know.

As I’ve said, to the extent of my knowledge, there is no lightweight and well-maintained (I mean not some two-year-old experiment by some obscure hacker, though I respect these things, I don’t have time right now to play with such codes) C library which implements the full XML standard and this pull metaphor. If anyone out there knows, do not hesitate to let me know.

[ tag:blog.huoc.org,2009:posts/3 ]
No comments · Comment

<< >> Page: 0 1 2