Reputation: 11194
I am currently doing a parsing of a bunch of large XML files using:
import xml.etree.ElementTree as ET
and this is my function:
def xmlfilter(input_file):
print('Processing file: ' + input_file)
tree = ET.ElementTree(file = input_file)
root = tree.getroot()
packagebody = root.find('./PackageBody')
# reverse traversal to prevent skipping nodes after removal and avoid another loop
for invvehicle in list(reversed(packagebody)):
try:
for port in invvehicle.findall('./PortfolioList/Portfolio'):
for hold in port.find('./Holding'):
for rawitem in hold:
# if ID belongs to list above, proceed to next invvehicle
if rawitem.tag == 'ID' and rawitem.text in IDs:
print(rawitem.text + ' under ' + invvehicle.attrib['_Id'])
raise SkipNode
# if no ID from list above is found, remove the node
packagebody.remove(invvehicle)
except SkipNode:
continue
ET.ElementTree(root).write(queuedir + "filtered_" + input_file)
ET.ElementTree(root).write(ftpdir + "filtered_" + input_file)
When parsing a 10gb worth of file, it maxes out the machine's memory. Is there anything I can do with the above script so I can lessen the memory usage? And if possible, to also make the processing faster?
I would appreciate it if there will be some solution/answer that only utilizes the above library. But if there is a better library that manages the memory way better than the above option, then please do suggest as well.
Upvotes: 0
Views: 463
Reputation: 77337
lxml
is usually more space efficient than ElementTree
because it backends to the C based libxml2. You should be able to write a single XPATH to select the nodes you want. That will remove the need to reverse your delete list. I wasn't sure how to deal with in IDs
so I wrote an XPATH extension function that checks if the XML text is in a set of values you supply. This class is initialized with your ids and its __call__
method is called by XPATH.
I think your XPATH would be something like
"PackageBody/*[not(is_in_ids(PortfolioList/Portfolio/Holding/ID/text()))]"
but I don't have your data to test against, so that's just a guess. I wrote a simpler example to demonstrate.
test.xml
<root>
<a><b>one</b><b>five</b><b>two</b><b>seven</b></a>
</root>
test.py
from lxml import etree
class IsInIDs:
"""lxml XPath extension function checks if single value
given is in the given list of ids"""
def __init__(self, list_of_ids):
self.ids = set(list_of_ids)
def __call__(self, context, value):
return len(value) and value[0].strip() in self.ids
my_ids = ["one", "two", "three"]
ns = etree.FunctionNamespace(None)
ns['is_in_ids'] = IsInIDs(my_ids)
doc = etree.parse("test.xml")
for item in doc.xpath("a/b[not(is_in_ids(text()))]"):
item.getparent().remove(item)
print(etree.tostring(doc))
Upvotes: 1