mnikley
mnikley

Reputation: 1645

Improving speed while iterating over ~400k XML files

This is more of a theoretical question to understand objects, garbage collection and performance of Python better.

Lets say i have a ton of XML files and want to iterate over each one, get all the tags, store them in a dict, increase counters for each tag etc. When i do this, the first, lets say 15k iterations, process really quick but afterwards the script slows down significantly, while the memory usage, CPU load etc. are fine. Why is that? Do i write hidden objects each iteration which are not cleaned up, can i do something to improve it? I tried to use regex instead of ElementTree but it wasnt worth the effort since i only want to extract first level tags and it would make it more complex.

Unfortunately i cannot give a reproducible example without providing the XML files, however this is my code:

import os
import datetime
import xml.etree.ElementTree as ElementTree

start_time = datetime.datetime.now()

original_implemented_tags = os.path.abspath("/path/to/file")

required_tags = {}
optional_tags = {}
new_tags = {}

# read original tags
for _ in open(original_implemented_tags, "r"):
    if "@XmlElement(name =" in _:
        _xml_attr = _.split('"')[1]
        if "required = true" in _:
            required_tags[_xml_attr] = 1  # i set this to 1 so i can use if dict.get(_xml_attr) (0 returns False)
        else:
            optional_tags[_xml_attr] = 1

# read all XML files from nested folder containing XML dumps and other files
clinical_trial_root_dir = os.path.abspath("/path/to/dump/folder")
xml_files = []
for root, dirs, files in os.walk(clinical_trial_root_dir):
    xml_files.extend([os.path.join(root, _) for _ in files if os.path.splitext(_)[-1] == '.xml'])


# function for parsing a file and extract unique tags
def read_via_etree(file):
    _root = ElementTree.parse(file).getroot()
    _main_tags = list(set([_.tag for _ in _root.findall("./")]))  # some tags occur twice
    for _attr in _main_tags:
        # if tag doesnt exist in original document, increase counts in new_tags
        if _attr not in required_tags.keys() and _attr not in optional_tags.keys():
            if _attr not in new_tags.keys():
                new_tags[_attr] = 1
            else:
                new_tags[_attr] += 1

        # otherwise, increase counts in either one of required_tags or optional_tags
        if required_tags.get(_attr):
            required_tags[_attr] += 1
        if optional_tags.get(_attr):
            optional_tags[_attr] += 1


# actual parsing with indicator
for idx, xml in enumerate(xml_files):
    if idx % 1000 == 0:
        print(f"Analyzed {idx} files")
    read_via_etree(xml)

# undoing the initial 1
for k in required_tags.keys():
    required_tags[k] -= 1

for k in optional_tags.keys():
    optional_tags[k] -= 1

print(f"Done parsing {len(xml_files)} documents in {datetime.datetime.now() - start_time}")

Example of one XML file:

<parent_element>
  <tag_i_need>
    <tag_i_dont_need>Some text i dont need</tag_i_dont_need>
  </tag_i_need>
  <another_tag_i_need>Some text i also dont need</another_tag_i_need>
</parent_element>

Upvotes: 0

Views: 161

Answers (1)

mnikley
mnikley

Reputation: 1645

After the helpful comments i added a timestamp to my loop indicating how much time is passed since the last 1k documents and flushed the sys.stdout:

import sys

loop_timer = datetime.datetime.now()
for idx, xml in enumerate(xml_files):
    if idx % 1000 == 0:
        print(f"Analyzed {idx} files in {datetime.datetime.now() - loop_timer}")
        sys.stdout.flush()
        loop_timer = datetime.datetime.now()
    read_via_etree(xml)

I think it makes sense now since the XML files vary in size, and due the fact that the standard output stream is buffered. Thanks to Albert Winestein

Upvotes: 1

Related Questions