Reputation: 51
W want to extract some data from tons of xml's with namespaces. Problem is that namespaces can be different in each xml and in each xml there are couple of them. Sample.xml look like this:
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="http://some.domain/path1/styl.xsl"?>
<a:root xmlns:a="http://some.domain/path1/" xmlns:b="http://some.domain/path2/" xmlns:c="http://some.domain/path3/" xmlns:d="http://some.domain/path4/" xmlns:e="http://some.domain/path5/">
<a:task>Get data from element your_data_sir in xml files</a:task>
<a:files_to_process>More than 2k</a:files_to_process>
<a:how>Using Python</a:how>
<a:obstacle>
<b:name>Namespaces</b:name>
<c:description>Each xml file contain same xmlns:prefixes but the UIR of each prefix my differ!</c:description>
</a:obstacle>
<a:look_here>
<d:your_data_sir>Glass of Whisky</d:your_data_sir>
<d:your_data_sir>Cigar</d:your_data_sir>
<d:your_data_sir>Python problem to solve</d:your_data_sir>
</a:look_here>
<e:other_things_to_know>
<c:thing>Element look_here is allways a child of root element.</c:thing>
<c:thing>look_here and your_data_sir preserve their prefixes in all xml files but URI can be different.</c:thing>
<c:thing>Some xml files have different elements before and after look_here element.</c:thing>
<c:thing>Number of siblings of look_here, before and after, may differ.</c:thing>
</e:other_things_to_know>
</a:root>
I can successfully get data from <a:your_data_sir> element with this script:
import xml.etree.ElementTree as ET
ns = {
'a': 'http://some.domain/path1/',
'b': 'http://some.domain/path2/',
'c': 'http://some.domain/path3/',
'd': 'http://some.domain/path4/',
}
dom = ET.parse(Sample.xml).getroot()
test = dom.find('a:look_here', ns)
for x in test:
print(x.text)
I'm building script that will get the data using above script on every xml file in folder and subfolders. The problem is that in some xml files URI in xmlns:a (or in other prefixes) can be different. In that case my script can't find <a:your data_sir>. I don't know and can't find the method to get all perfixes from processed xml file and construct a dictionary of namespaces. Or maybe there is a different method to solve my problem.
Please help. I'm new in python so if you may please explain your solution.
Upvotes: 2
Views: 132
Reputation: 51
Thanks to @mzjn I've managed to solve my problem. It seems very simple now. lxml has a nsmap
property which builds dictionary of namespaces. You have to install lxml first as it isn't in python by default. Now my script looks like this:
from lxml import etree
dom = etree.parse('/path/to/Sample.xml').getroot()
ns = dom.nsmap
test = dom.find('a:look_here', ns)
for x in test:
print(x.text)
In my case dom.nsmap
will return
{'a': 'http://some.domain/path1/', 'b': 'http://some.domain/path2/', 'c': 'http://some.domain/path3/', 'd': 'http://some.domain/path4/', 'e': 'http://some.domain/path5/'}
which is exactly what i was fighting for past two days. Now i can feed it with thousands of files to get data from them without risk of missing something.
Upvotes: 3