Reputation: 1645
This is more of a theoretical question to understand objects, garbage collection and performance of Python better.
Lets say i have a ton of XML files and want to iterate over each one, get all the tags, store them in a dict, increase counters for each tag etc. When i do this, the first, lets say 15k iterations, process really quick but afterwards the script slows down significantly, while the memory usage, CPU load etc. are fine. Why is that? Do i write hidden objects each iteration which are not cleaned up, can i do something to improve it? I tried to use regex instead of ElementTree but it wasnt worth the effort since i only want to extract first level tags and it would make it more complex.
Unfortunately i cannot give a reproducible example without providing the XML files, however this is my code:
import os
import datetime
import xml.etree.ElementTree as ElementTree
start_time = datetime.datetime.now()
original_implemented_tags = os.path.abspath("/path/to/file")
required_tags = {}
optional_tags = {}
new_tags = {}
# read original tags
for _ in open(original_implemented_tags, "r"):
if "@XmlElement(name =" in _:
_xml_attr = _.split('"')[1]
if "required = true" in _:
required_tags[_xml_attr] = 1 # i set this to 1 so i can use if dict.get(_xml_attr) (0 returns False)
else:
optional_tags[_xml_attr] = 1
# read all XML files from nested folder containing XML dumps and other files
clinical_trial_root_dir = os.path.abspath("/path/to/dump/folder")
xml_files = []
for root, dirs, files in os.walk(clinical_trial_root_dir):
xml_files.extend([os.path.join(root, _) for _ in files if os.path.splitext(_)[-1] == '.xml'])
# function for parsing a file and extract unique tags
def read_via_etree(file):
_root = ElementTree.parse(file).getroot()
_main_tags = list(set([_.tag for _ in _root.findall("./")])) # some tags occur twice
for _attr in _main_tags:
# if tag doesnt exist in original document, increase counts in new_tags
if _attr not in required_tags.keys() and _attr not in optional_tags.keys():
if _attr not in new_tags.keys():
new_tags[_attr] = 1
else:
new_tags[_attr] += 1
# otherwise, increase counts in either one of required_tags or optional_tags
if required_tags.get(_attr):
required_tags[_attr] += 1
if optional_tags.get(_attr):
optional_tags[_attr] += 1
# actual parsing with indicator
for idx, xml in enumerate(xml_files):
if idx % 1000 == 0:
print(f"Analyzed {idx} files")
read_via_etree(xml)
# undoing the initial 1
for k in required_tags.keys():
required_tags[k] -= 1
for k in optional_tags.keys():
optional_tags[k] -= 1
print(f"Done parsing {len(xml_files)} documents in {datetime.datetime.now() - start_time}")
Example of one XML file:
<parent_element>
<tag_i_need>
<tag_i_dont_need>Some text i dont need</tag_i_dont_need>
</tag_i_need>
<another_tag_i_need>Some text i also dont need</another_tag_i_need>
</parent_element>
Upvotes: 0
Views: 161
Reputation: 1645
After the helpful comments i added a timestamp to my loop indicating how much time is passed since the last 1k documents and flushed the sys.stdout:
import sys
loop_timer = datetime.datetime.now()
for idx, xml in enumerate(xml_files):
if idx % 1000 == 0:
print(f"Analyzed {idx} files in {datetime.datetime.now() - loop_timer}")
sys.stdout.flush()
loop_timer = datetime.datetime.now()
read_via_etree(xml)
I think it makes sense now since the XML files vary in size, and due the fact that the standard output stream is buffered. Thanks to Albert Winestein
Upvotes: 1