Reputation: 135
I am new for this stuff. Because of my original xml is about 8GB it is hard to explore all parents, grandparents, grandgrandparents, etc. for the interested child in the original xml manually. I am trying to look through all the nodes until interested child is found. So I want to create "skeleton" structure of xml upto the interested child of country_data.xml from here https://docs.python.org/2/library/xml.etree.elementtree.html. Sorry for the code:
def LookThrougStructure(parent, xpath_str, stop_flag):
out_str.write('Parent tag: %s\n' % (parent.tag))
for child in parent:
if child.tag == my_tag:
out_str.write('Child tag: %s\n' % (child.tag))
#my_node_is_found_flag = 1
break
LookThrougStructure(child, child.tag, 0)
return
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
my_tag = 'neighbor'
out_str = open('xml_structure.txt', 'w')
LookThrougStructure(root, root.tag, my_tag)
out_str.close()
It works wrong and yelds all node tags:
Parent tag: data Parent tag: country Parent tag: rank Parent tag: year Parent tag: gdppc Child tag: neighbor Parent tag: country Parent tag: rank Parent tag: year Parent tag: gdppc Child tag: neighbor Parent tag: country Parent tag: rank Parent tag: year Parent tag: gdppc Child tag: neighbor
But I want something like that (my interested child is"neighbor"):
Or that: /data/country/neighbor. What is wrong?
Upvotes: 1
Views: 96
Reputation: 135
@Padraic. Thanks a lot! Your code is mostly what I want. But if I insert additional node (for example attributes) which is child of country node and parent for the neighbor node it gives unexpected results:
<data>
<country name="Liechtenstein">
<attributes>
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</attributes>
</country>
<country name="Singapore">
<attributes>
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</attributes>
</country>
<country name="Panama">
<attributes>
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</attributes>
</country>
Anyway your help was very fruitfull. I take your code and create this one:
import lxml.etree as et
root = et.parse('country_data.xml')
out_f = open('getpath.txt', 'w')
my_str1 = 'country[1]'
my_str2 = 'neighbor[1]'
for e in root.iter():
s = root.getelementpath(e)
if my_str1 not in s:
continue
if my_str2 not in s:
continue
out_f.write('%s\n' %(s))
break
out_f.close()
The idea is simple: if elementpath has string 'country' and 'neighbor' it is writed down to the output file. For the original xml example it gives: country[1]/neighbor[1]. And for xml with additional parent it gives: country[1]/attributes/neighbor[1].
Upvotes: 1
Reputation: 180391
If I understand you correctly you want something like:
look_through_structure(parent, my_tag):
for node in parent.iter("*"):
out_str.write('Parent tag: %s\n' % node.tag)
for nxt in node:
if nxt.tag == my_tag:
out_str.write('child tag: %s\n' % my_tag)
return
out_str.write('Parent tag: %s\n' % nxt.tag)
if any(ch.tag == my_tag for ch in nxt.getchildren()):
out_str.write('child tag: %s\n' % my_tag)
return
If we change the function a bit and yield the tags:
def look_through_structure(parent, my_tag):
for node in parent.iter("*"):
yield node.tag
for nxt in node:
if nxt.tag == my_tag:
yield nxt.tag
return
yield nxt.tag
if any(ch.tag == my_tag for ch in nxt.getchildren()):
yield my_tag
return
And run it on the file:
In [24]: root = tree.getroot()
In [25]: my_tag = 'neighbor'
In [26]: list(look_through_structure(root, my_tag))
Out[26]: ['data', 'country', 'neighbor']
Also if you just wanted the full path, lxml's getpath
would do that for you:
import lxml.etree as ET
tree = ET.parse('country.xml')
my_tag = 'neighbor'
print(tree.getpath(tree.find(".//neighbor")))
Output:
/data/country[1]/neighbor[1]
Upvotes: 1