user2401432
user2401432

Reputation: 23

Python: How to process large XML file with a lot of childs in 1 root

I have XML file with a data structure like

<report>
  <table>
    <detail name="John" surname="Smith">
    <detail name="Michael" surname="Smith">
    <detail name="Nick" surname="Smith">
    ... {a lot of <detail> elements}
  </table>
</report>

I need to check whether elements with attribute 'name'=='surname'.

XML file is >1 GB, and I have an error trying etree.parse(file).

How can I process elements ony-by-one using Python and LXML?

Upvotes: 1

Views: 5282

Answers (3)

Parfait
Parfait

Reputation: 107587

Consider iterparse that allows you to work on elements as the tree is being built. Below checks if name attribute is equivalent to surname attribute. Use the if block to process further like conditionally append values to a list:

import xml.etree.ElementTree as et

data = []
path = "/path/to/source.xml"

# get an iterable
context = et.iterparse(path, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
ev, root = next(context)

for ev, el in context:
    if ev == 'start' and el.tag == 'detail':
        print(el.attrib['name'] == el.attrib['surname'])
        data.append([el.attrib['name'], el.attrib['surname']])
        root.clear()

print(data)
# False
# False
# False

# [['John', 'Smith'], ['Michael', 'Smith'], ['Nick', 'Smith']]

Upvotes: 3

newtover
newtover

Reputation: 32094

There are basically three standard approaches to parsing XML:

  • building an in-memory Document Object Model (DOM) - you load the whole document into memory and can arbitrarily walk along the tree
  • writing a pushing SAX parser - processing of the document becomes a sequence of events (an opening tag, the text, an ending tag, comment, processing instruction, etc) to several of which you can subscribe. You register your callbacks and run the parsing. The document is read until the end, but the parser doesn't build in internal representation of the whole document.
  • writing a pulling StAX parser - the parser streams different events, you sequentially process all of them, but can stop at any time (useful for parsing of XML-metadata at the beginning of the document and stop processing)

lxml is a binding to libxml C library, which is an implementation of DOM, the iterparse method seems to be the implementation of the StAX approach. The SAX parser is built into the python itself: https://docs.python.org/3.6/library/xml.sax.html

For your case the standard approach is to use a SAX parser.

Upvotes: 2

Bill Bell
Bill Bell

Reputation: 21643

You could use the iterparse method, which is meant for handling large xml files. However, your file has an especially simple structure. Using iterparse would be unnecessarily complicated.

I will provide two answers in one script. I answer your question directly by showing how to parse lines in the xml using lxml and I provide what I think is likely to be a better answer using a regex.

The code reads each line in the xml and ignores those lines that do not begin with 'try ... except. When the script finds such a line it passes it to etree from lxml for parsing then displays the attributes from the line. Afterwards it uses a regex to parse out the same attributes and to display them.

I strongly suspect that the regex would be faster.

>>> from lxml import etree
>>> report = '''\
... <report>
...     <table>
...         <detail name="John" surname="Smith">
...         <detail name="Michael" surname="Smith">
...         <detail name="Nick" surname="Smith">
...     </table>
... </report>'''
>>> import re
>>> re.search(r'name="([^"]*)"\s+surname="([^"]*)', line).groups()
('John', 'Smith')
>>> for line in report.split('\n'):
...     if line.strip().startswith('<detail'):
...         tree = etree.fromstring(line.replace('>', '/>'))
...         tree.attrib['name'], tree.attrib['surname']
...         re.search(r'name="([^"]*)"\s+surname="([^"]*)', line).groups()
...         
('John', 'Smith')
('John', 'Smith')
('Michael', 'Smith')
('Michael', 'Smith')
('Nick', 'Smith')
('Nick', 'Smith')

Upvotes: 0

Related Questions