Reputation: 537
I have a large (~50Mb) file containing poorly formatted XML describing documents and properties between <item> </item>
tags, and I want to extract the text from all English documents.
Python's standard XML parsing utilities (dom, sax, expat) choke on the bad formatting, and more forgiving libraries (sgmllib, BeautifulSoup) parse the entire file and take too long.
<item>
<title>some title</title>
<author>john doe</author>
<lang>en</lang>
<document> .... </document>
</item>
Does anyone know a way to extract text between <document> </document>
only if the lang=en
without parsing the entire document?
Additional information: Why it's "poorly formatted"
Some of the documents have an attribute <dc:link></dc:link>
which causes problems with the parsers. Python's xml.minidom complains:
ExpatError: unbound prefix: line 13, column 0
Upvotes: 1
Views: 1327
Reputation: 342273
if you have gawk
gawk 'BEGIN{
RS="</item>"
startpat="<document>"
endpat="</document>"
lpat=length(startpat)
epat=length(endpat)
}
/<lang>en<\/lang>/{
match($0,"<document>")
start=RSTART
match($0,"</document>")
end=RSTART
print substr($0,start+lpat,end-(start+lpat))
}' file
output
$ more file
Junk
Junk
<item>
<title>some title</title>
<author>john doe</author>
<lang>en</lang>
<document> text
i want blah ............ </document>
</item>
junk
junk
<item>
<title>some title</title>
<author>jane doe</author>
<lang>ch</lang>
<document> junk text
.. ............ </document>
</item>
junk
blahblah..
<item>
<title>some title</title>
<author>GI joe</author>
<lang>en</lang>
<document> text i want ..... in one line </document>
</item>
aksfh
aslkfj
dflkas
$ ./shell.sh
text
i want blah ............
text i want ..... in one line
Upvotes: 1
Reputation: 3377
I think that if you are ok with Java, then VTD-XML would work without any issues of those undefined prefixes...
Upvotes: 0
Reputation: 86744
Depending on how (and how badly) the document is 'broken' it might be possible to write a simple filter in perl/python that fixes it enough to pass XML well-formedness tests and make it into a DOM or XSLT.
Can you add some examples of what's wrong with the input?
Upvotes: 0
Reputation: 57926
You'll need some event oriented parser, like SAX, or in .NET, System.Xml.XmlReader
;
Upvotes: 0