Reputation: 1273
I am parsing an XML payload using ElementTree. I cannot share the exact code or file as it shares sensitive information. I am able to successfully extract the information I need by iterating through an element (as seen in the ElementTree documentation) and appending the output to lists. For example:
list_col_name = []
list_col_value = []
for col in root.iter('my_table'):
# get col name
col_name = col.find('col_name').text
list_col_name.append(col_name
# get col value
col_value = col.find('col_value').text
list_col_value.append(col_value)
I can now put these into a dictionary and proceed with the remainder of what needs to be done:
dict_ = dict(zip(list_col_name, list_col_value))
However, I need this to happen as quickly as possible and am wondering if there is a way in which I can extract list_col_name
at once (i.e., using findall()
or something like that). Just curious as to way to increase the speed of xml parsing if possible. All answers/recommendations are appreciated. Thank you in advance.
Upvotes: 0
Views: 1312
Reputation: 2469
I don't know if there's anything you want.
from simplified_scrapy import SimplifiedDoc
html = '''
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
'''
doc = SimplifiedDoc(html)
ranks = doc.selects('country>(rank>text())')
print (ranks)
ranks = doc.selects('country>rank()')
print (ranks)
ranks = doc.selects('country>children()')
print (ranks)
Result:
['1', '4', '68']
[{'tag': 'rank', 'html': '1'}, {'tag': 'rank', 'html': '4'}, {'tag': 'rank', 'html': '68'}]
[[{'tag': 'rank', 'html': '1'}, {'tag': 'year', 'html': '2008'}, {'tag': 'gdppc', 'html': '141100'}, {'name': 'Austria', 'direction': 'E', 'tag': 'neighbor'}, {'name': 'Switzerland', 'direction': 'W', 'tag': 'neighbor'}], [{'tag': 'rank', 'html': '4'}, {'tag': 'year', 'html': '2011'}, {'tag': 'gdppc', 'html': '59900'}, {'name': 'Malaysia', 'direction': 'N', 'tag': 'neighbor'}], [{'tag': 'rank', 'html': '68'}, {'tag': 'year', 'html': '2011'}, {'tag': 'gdppc', 'html': '13600'}, {'name': 'Costa Rica', 'direction': 'W', 'tag': 'neighbor'}, {'name': 'Colombia', 'direction': 'E', 'tag': 'neighbor'}]]
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
Upvotes: 1
Reputation: 107587
Consider list comprehension with findall
to avoid list initialization/append and explicit for
loop which may marginally improve performance:
# FINDALL LIST COMPREHENSION
list_col_name = [e.text for e in root.findall('./my_table/col_name')]
list_col_value = [e.text for e in root.findall('./my_table/col_value')]
dict(zip(list_col_name, list_col_value))
Alternatively, with lxml
(third-party library) that fully supports XPath 1.0, consider xpath()
that can assign parsing output directly to lists also avoiding initialization/append and for
loop:
import lxml.etree as et
...
# XPATH LISTS
list_col_name = root.xpath('my_table/col_name/text()')
list_col_value = root.xpath('my_table/col_value/text()')
dict(zip(list_col_name, list_col_value))
Upvotes: 2
Reputation: 30971
My proposal is to use "incremental" parsing of the source file, based on iterparse method. The reason is that you actually:
Another hint is to use lxml library, instead of ElementTree. The reason is that although iterparse method exists in both these libraries, but the lxml version has additional tag parameter, so you are able to "limit" the loop to processing the tags of interest only.
As the source file I used (something like):
<root>
<my_table id="t1">
<col_name>N1</col_name>
<col_value>V1</col_value>
<some_other_stuff>xx1</some_other_stuff>
</my_table>
<my_table id="t2">
<col_name>N2</col_name>
<col_value>V2</col_value>
<some_other_stuff>xx1</some_other_stuff>
</my_table>
<my_table id="t3">
<col_name>N3</col_name>
<col_value>V3</col_value>
<some_other_stuff>xx1</some_other_stuff>
</my_table>
</root>
Actually, my source file:
my_table
element (not 3),some_other_stuff
is repeated 8 times (in each my_table
), to
simulate other elements contained in each my_table
.I performed 3 tests, using %timeit:
Your loop, with prepended parsing of the source XML file:
from lxml import etree as et
def fn1():
root = et.parse('Tables.xml')
list_col_name = []
list_col_value = []
for col in root.iter('my_table'):
col_name = col.find('col_name').text
list_col_name.append(col_name)
col_value = col.find('col_value').text
list_col_value.append(col_value)
return dict(zip(list_col_name, list_col_value))
The execution time was 1.74 ms.
My loop, based on iterparse, processing only the "required" elements:
def fn2():
key = ''
dict_ = {}
context = et.iterparse('Tables.xml', tag=['my_table', 'col_name', 'col_value'])
for action, elem in context:
tag = elem.tag
txt = elem.text
if tag == 'col_name':
key = txt
elif tag == 'col_value':
dict_[key] = txt
elif tag == 'my_table':
elem.clear()
elem.getparent().remove(elem)
return dict_
I assume that in each my_table element col_name occurs before col_value and each my_table contains only one child named col_name and col_value.
Note also that the above function clears each my_table element and removes it from the parsed XML tree (getparent function is available only in lxml version).
Another improvement is that I "directly" add each key / value pair to the dictionary to be returned by this function, so no zip is needed.
The execution time is 1.33 ms. Not very much quicker, but at least some time gain is visible.
You can also read all col_name and col_value elements, calling findall and then call zip:
def fn3():
root = et.parse('Tables.xml')
list_col_name = []
for elem in root.findall('.//col_name'):
list_col_name.append(elem.text)
list_col_value = []
for elem in root.findall('.//col_value'):
list_col_value.append(elem.text)
return dict(zip(list_col_name, list_col_value))
The execution time is 1.38 ms. Also something quicker that your original solution, but no significant difference to my first solution (fn2).
Of course, the final result heavily depends on:
Upvotes: 2