Working with huge XML files - tools of the trade.

· Read in about 2 min · (386 words) ·

XMLStarlet is great for slicing and dicing huge XML files. Had a run in recently - had a 80 Mb XML file in a single line :D. Guess what, most editors that I tried balked and fell over. This was on a 2Gig Core2 Duo machine.

XMLSpy, vi, emacs, notepad++ all died - and trying to do something with a 80 Gig XML where the 80 gigs are on a single line isnt much fun. So the first order of business was to pretty print the XML. XMLstarlet worked great -

xmlstarlet fo file.xml > output.xml

and you’re done.

The next order of business was that we needed to validate the XML document against a schema. Our first attempt was with Sun’s multi schema validator (MSV). MSV does not validate the whole document but instead stops after a certain number of failures. So, MSV - out, XMLStarlet in. XMLStarlet can validate documents again W3C schema, DTD  or a RELAXNG schema.

xmlstarlet val --err --xsd schema.xsd input.xml >  errors.txt

And presto! - you get an error report that you can slice and dice with sed/awk or anything else at all.

XMLStarlet also allows you to write Xpaths to query the xml - however, I found the syntax too weird and round about. A better alternative is a perl based solutions - XSH2 - a command line xml editing shell. You can install it under cygwin and it supports basic command pipelining and redirection.

So go ahead and launch XSH. At your cygwin prompt

[~]xsh
---------------------------------------
 xsh - XML Editing Shell version 2.1.1
---------------------------------------

Copyright (c) 2002 Petr Pajas.
This is free software, you may use it and distribute it under
either the GNU GPL Version 2, or under the Perl Artistic License.
Using terminal type: Term::ReadLine::Gnu
Hint: Type `help' or `help | less' to get more help.
$scratch/>

Now, lets load up our document, type

$scratch/>$x:=open formatted.xml

Your prompt changes to

$x/>

So go ahead and try a few xpaths

$x/> ls /path/to/node

and XSH prints out the matching nodes. Now what if you need to create a document fragment of nodes matching a certain xpath? Piece of cake - do ahead

$x/> ls /path/to/node | tee fragment.xml

XSH2 has many, many more features - but this should be good enough to get you off the ground.