Reputation: 141
The problem: given a 20gb xml file with the following structure
<root>
<outer>
<inner prop="x">...</inner>
<inner prop="y">...</inner>
</outer>
<outer>
<inner prop="z">...</inner>
<inner prop="f">...</inner>
</outer>
....
....
</root>
How could one xpath count(//outer[inner/@prop="x" and inner/@prop="y"])
efficiently?
I have tried xmllint, pcregreg, xmlstarlet, xml_grep in Linux, even awk and grep but I keep getting the system out of memory.
I was considering python sax module, but haven't found anything relevant and I also don't know how such an xpath like count could work with streaming. It would also be great if sax could somehow ignore inner text, as the file in questions contains several unescaped characters which render the xml not well formed...
Tough one
Upvotes: 0
Views: 60
Reputation: 3581
For large xml files you can use iterparse()
and clear()
memory:
from lxml import etree
def count_matching_outer_elements(xml_file):
count = 0
context = etree.iterparse(xml_file, events=('end',), tag='outer')
for _, outer_elem in context:
# XPath to check if the 'outer' element contains both 'inner' elements with 'prop="x"' and 'prop="y"'
inner_props = outer_elem.xpath('.//inner[@prop="x"] and .//inner[@prop="y"]')
# If both conditions are met, increment the counter
if inner_props:
count += 1
outer_elem.clear()
return count
xml_file = '20GB.xml'
result = count_matching_outer_elements(xml_file)
print(f"Number of matched <outer> tag: {result}")
Upvotes: 3
Reputation: 163577
Saxon-EE (disclaimer, my company's product, license required) supports a streamed subset of XPath. From the command line:
java net.sf.saxon.Query -stream:on -s:test.xml -t
-qs:"count(/root/outer/copy-of(.)[inner/@prop='x' and inner/@prop='y'])"
The copy-of
is needed to make the query streamable - it causes each outer
element, as it is encountered, to be copied to a regular XDM tree that can then be processed using any XPath expression. The predicate (within the square brackets) wouldn't otherwise be streamable because it's searching the subtree twice.
You could use //outer
rather than /root/outer
if you need to, but it makes streaming a lot more complex because it has to cater for the possibility of one outer
element being nested within another.
Upvotes: 1