Edited 22.06. Incompatible syntax changes and more examples.
Since Matthew Sporleder on tech-userlevel has (implicitly) suggested that xmlgrep could be used to dissect Atom feeds, but was a bit lost at how to do it exactly, I thought I’d post a little demo of a console session playing with an Atom file and xmlgrep (as well as some other command-line tools).
First, let’s fetch the feed:
$ ftp http://mspo.com/blog/atom.xml
[...]
As a side effect of xmlgrep, we might want to indent the XML file to
make it human-readable:
$ xmlgrep '*' atom.xml | more
List all posts in the NetBSD category with their IDs:
$ xmlgrep -x 'entry[category/@term=NetBSD]/(title|id)/.' atom.xml
tag:blogger.com,1999:blog-6347225410141611306.post-1131641169617411392
NetBSD quotas - quickstart
tag:blogger.com,1999:blog-6347225410141611306.post-1939815769827620970
NetBSD device drivers - easier than you might think
[...]
Those of you yet unfamiliar with the syntax might have some trouble
understanding. The previous pattern could be read "select a text child
of an id or title element, itself a child of an entry element, which
contains a category element which has an attribute child term equal to
NetBSD." Step by step, you should notice that a[b] is read "a such
that b", | stands for "or", / for "child of", @ for "attribute",
. for "text", and the braces are used for grouping purposes.
Now, let’s select a post by ID:
$ xmlgrep -x 'entry[id/.~"post-1939"]#' atom.xml
Or, select a post by title and view its contents using w3m:
$ xmlgrep -x 'entry[title/.~"device drivers"]/content/.' atom.xml |
> sed -e 's/</</g' -e 's/>/>/g' -e 's/&/\&/g' |
> w3m -T text/html
As a side note, I should mention that up to now we have used
subpatterns quite a lot. This is because the Atom feed specification
does not force an order (or does it?) on the children of entry
elements. With more precise knowledge of the order of elements
relative to each other, we could have optimized the pattern to use %
and %% where possible. Subpatterns are costly, but for data sets
this size, we probably don’t care much.
Let’s print all entry titles which date from March 2009 using the fact
that we know the updated element comes before the title one:
$ xmlgrep -x 'entry/updated[.~"^2009-03"]%%title/.' atom.xml
A friend of mine told me it would be useful to have arithmetic
predicates. I think they will feature in xmltools sooner or later, but
even without them, it is still possible to do some simple statistics,
by combining the results with awk(1), for example. The following
one-liner counts the number of posts that have no older than March:
$ xmlgrep -x 'entry/updated/.' atom.xml | awk -F - '$2>=3' | wc -l
That’s it; I hope this will help people who want to get started with xmlgrep. If you have other good examples you’d want me to elaborate, do not hesitate to send me a mail!