unicorn
unicorn

Reputation: 141

Xpath on 20gb xml file

The problem: given a 20gb xml file with the following structure

<root>
 <outer>
   <inner prop="x">...</inner>
   <inner prop="y">...</inner>
 </outer>
 <outer>
   <inner prop="z">...</inner>
   <inner prop="f">...</inner>
 </outer>
....
....
</root>

How could one xpath count(//outer[inner/@prop="x" and inner/@prop="y"]) efficiently?

I have tried xmllint, pcregreg, xmlstarlet, xml_grep in Linux, even awk and grep but I keep getting the system out of memory.

I was considering python sax module, but haven't found anything relevant and I also don't know how such an xpath like count could work with streaming. It would also be great if sax could somehow ignore inner text, as the file in questions contains several unescaped characters which render the xml not well formed...

Tough one

Upvotes: 0

Views: 60

Answers (2)

Hermann12
Hermann12

Reputation: 3581

For large xml files you can use iterparse() and clear() memory:

from lxml import etree

def count_matching_outer_elements(xml_file):
    count = 0
    context = etree.iterparse(xml_file, events=('end',), tag='outer')
    for _, outer_elem in context:
        # XPath to check if the 'outer' element contains both 'inner' elements with 'prop="x"' and 'prop="y"'
        inner_props = outer_elem.xpath('.//inner[@prop="x"] and .//inner[@prop="y"]')
        # If both conditions are met, increment the counter
        if inner_props:
            count += 1
        outer_elem.clear()
    return count

xml_file = '20GB.xml'
result = count_matching_outer_elements(xml_file)
print(f"Number of matched <outer> tag: {result}")

Upvotes: 3

Michael Kay
Michael Kay

Reputation: 163577

Saxon-EE (disclaimer, my company's product, license required) supports a streamed subset of XPath. From the command line:

java net.sf.saxon.Query -stream:on -s:test.xml -t 
  -qs:"count(/root/outer/copy-of(.)[inner/@prop='x' and inner/@prop='y'])"

The copy-of is needed to make the query streamable - it causes each outer element, as it is encountered, to be copied to a regular XDM tree that can then be processed using any XPath expression. The predicate (within the square brackets) wouldn't otherwise be streamable because it's searching the subtree twice.

You could use //outer rather than /root/outer if you need to, but it makes streaming a lot more complex because it has to cater for the possibility of one outer element being nested within another.

Upvotes: 1

Related Questions