Reputation: 537

Extract text from particular elements of a large poorly formatted XML file

I have a large (~50Mb) file containing poorly formatted XML describing documents and properties between <item> </item> tags, and I want to extract the text from all English documents.

Python's standard XML parsing utilities (dom, sax, expat) choke on the bad formatting, and more forgiving libraries (sgmllib, BeautifulSoup) parse the entire file and take too long.

<item>
  <title>some title</title>
  <author>john doe</author>
  <lang>en</lang>
  <document> .... </document>
</item>

Does anyone know a way to extract text between <document> </document> only if the lang=en without parsing the entire document?

Additional information: Why it's "poorly formatted"

Some of the documents have an attribute <dc:link></dc:link> which causes problems with the parsers. Python's xml.minidom complains:

ExpatError: unbound prefix: line 13, column 0

Upvotes: 1

Answers (4)

ghostdog74

Reputation: 342977

if you have gawk

gawk 'BEGIN{
 RS="</item>"
 startpat="<document>"
 endpat="</document>"
 lpat=length(startpat)
 epat=length(endpat)
}
/<lang>en<\/lang>/{
    match($0,"<document>")
    start=RSTART
    match($0,"</document>")
    end=RSTART
    print substr($0,start+lpat,end-(start+lpat)) 
}' file

output

$ more file
Junk
Junk
<item>
  <title>some title</title>
  <author>john doe</author>
  <lang>en</lang>
  <document> text
         i want blah ............  </document>
</item>
junk
junk
<item>
  <title>some title</title>
  <author>jane doe</author>
  <lang>ch</lang>
  <document> junk text
           ..       ............ </document>
</item>
junk
blahblah..
<item>
  <title>some title</title>
  <author>GI joe</author>
  <lang>en</lang>
  <document>  text i want ..... in one line  </document>
</item>
aksfh
aslkfj
dflkas

$ ./shell.sh
 text
         i want blah ............
  text i want ..... in one line

Upvotes: 1

vtd-xml-author

Reputation: 3377

I think that if you are ok with Java, then VTD-XML would work without any issues of those undefined prefixes...

Upvotes: 0

Jim Garrison

Reputation: 86774

Depending on how (and how badly) the document is 'broken' it might be possible to write a simple filter in perl/python that fixes it enough to pass XML well-formedness tests and make it into a DOM or XSLT.

Can you add some examples of what's wrong with the input?

Upvotes: 0

Rubens Farias

Reputation: 57996

You'll need some event oriented parser, like SAX, or in .NET, System.Xml.XmlReader;

Upvotes: 0

Extract text from particular elements of a large poorly formatted XML file

Answers (4)

Related Questions