Optimize XML parsing to use lesser memory

Question

I am currently doing a parsing of a bunch of large XML files using:

import xml.etree.ElementTree as ET

and this is my function:

def xmlfilter(input_file):
    print('Processing file: ' + input_file)
    tree = ET.ElementTree(file = input_file)
    root = tree.getroot()
    packagebody = root.find('./PackageBody')

    # reverse traversal to prevent skipping nodes after removal and avoid another loop
    for invvehicle in list(reversed(packagebody)):
        try:
            for port in invvehicle.findall('./PortfolioList/Portfolio'):
                for hold in port.find('./Holding'):
                    for rawitem in hold:
                        # if ID belongs to list above, proceed to next invvehicle
                        if rawitem.tag == 'ID' and rawitem.text in IDs:
                            print(rawitem.text + ' under ' + invvehicle.attrib['_Id'])
                            raise SkipNode

            # if no ID from list above is found, remove the node
            packagebody.remove(invvehicle)

        except SkipNode:
            continue

    ET.ElementTree(root).write(queuedir + "filtered_" + input_file)
    ET.ElementTree(root).write(ftpdir + "filtered_" + input_file)

When parsing a 10gb worth of file, it maxes out the machine's memory. Is there anything I can do with the above script so I can lessen the memory usage? And if possible, to also make the processing faster?

I would appreciate it if there will be some solution/answer that only utilizes the above library. But if there is a better library that manages the memory way better than the above option, then please do suggest as well.

tdelaney · Accepted Answer

lxml is usually more space efficient than ElementTree because it backends to the C based libxml2. You should be able to write a single XPATH to select the nodes you want. That will remove the need to reverse your delete list. I wasn't sure how to deal with in IDs so I wrote an XPATH extension function that checks if the XML text is in a set of values you supply. This class is initialized with your ids and its __call__ method is called by XPATH.

I think your XPATH would be something like

"PackageBody/*[not(is_in_ids(PortfolioList/Portfolio/Holding/ID/text()))]"

but I don't have your data to test against, so that's just a guess. I wrote a simpler example to demonstrate.

test.xml


        onefivetwoseven

test.py

from lxml import etree
 
class IsInIDs:

    """lxml XPath extension function checks if single value
    given is in the given list of ids"""
    def __init__(self, list_of_ids):
        self.ids = set(list_of_ids)
        
    def __call__(self, context, value):
        return len(value) and value[0].strip() in self.ids 

my_ids = ["one", "two", "three"]

ns = etree.FunctionNamespace(None)
ns['is_in_ids'] = IsInIDs(my_ids)

doc = etree.parse("test.xml")
for item in doc.xpath("a/b[not(is_in_ids(text()))]"):
    item.getparent().remove(item)

print(etree.tostring(doc))

Optimize XML parsing to use lesser memory

Answers (1)

Related Questions