Reputation: 23
I have XML file with a data structure like
<report>
<table>
<detail name="John" surname="Smith">
<detail name="Michael" surname="Smith">
<detail name="Nick" surname="Smith">
... {a lot of <detail> elements}
</table>
</report>
I need to check whether elements with attribute 'name'=='surname'.
XML file is >1 GB, and I have an error trying etree.parse(file).
How can I process elements ony-by-one using Python and LXML?
Upvotes: 1
Views: 5282
Reputation: 107587
Consider iterparse that allows you to work on elements as the tree is being built. Below checks if name attribute is equivalent to surname attribute. Use the if
block to process further like conditionally append values to a list:
import xml.etree.ElementTree as et
data = []
path = "/path/to/source.xml"
# get an iterable
context = et.iterparse(path, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
ev, root = next(context)
for ev, el in context:
if ev == 'start' and el.tag == 'detail':
print(el.attrib['name'] == el.attrib['surname'])
data.append([el.attrib['name'], el.attrib['surname']])
root.clear()
print(data)
# False
# False
# False
# [['John', 'Smith'], ['Michael', 'Smith'], ['Nick', 'Smith']]
Upvotes: 3
Reputation: 32094
There are basically three standard approaches to parsing XML:
lxml
is a binding to libxml
C library, which is an implementation of DOM, the iterparse
method seems to be the implementation of the StAX approach. The SAX parser is built into the python itself: https://docs.python.org/3.6/library/xml.sax.html
For your case the standard approach is to use a SAX parser.
Upvotes: 2
Reputation: 21643
You could use the iterparse
method, which is meant for handling large xml files. However, your file has an especially simple structure. Using iterparse would be unnecessarily complicated.
I will provide two answers in one script. I answer your question directly by showing how to parse lines in the xml using lxml and I provide what I think is likely to be a better answer using a regex.
The code reads each line in the xml and ignores those lines that do not begin with 'try ... except. When the script finds such a line it passes it to etree
from lxml for parsing then displays the attributes from the line. Afterwards it uses a regex to parse out the same attributes and to display them.
I strongly suspect that the regex would be faster.
>>> from lxml import etree
>>> report = '''\
... <report>
... <table>
... <detail name="John" surname="Smith">
... <detail name="Michael" surname="Smith">
... <detail name="Nick" surname="Smith">
... </table>
... </report>'''
>>> import re
>>> re.search(r'name="([^"]*)"\s+surname="([^"]*)', line).groups()
('John', 'Smith')
>>> for line in report.split('\n'):
... if line.strip().startswith('<detail'):
... tree = etree.fromstring(line.replace('>', '/>'))
... tree.attrib['name'], tree.attrib['surname']
... re.search(r'name="([^"]*)"\s+surname="([^"]*)', line).groups()
...
('John', 'Smith')
('John', 'Smith')
('Michael', 'Smith')
('Michael', 'Smith')
('Nick', 'Smith')
('Nick', 'Smith')
Upvotes: 0