E Bobrov
E Bobrov

Reputation: 135

How to produce xml structure up to certain xml node ussing Python?

I am new for this stuff. Because of my original xml is about 8GB it is hard to explore all parents, grandparents, grandgrandparents, etc. for the interested child in the original xml manually. I am trying to look through all the nodes until interested child is found. So I want to create "skeleton" structure of xml upto the interested child of country_data.xml from here https://docs.python.org/2/library/xml.etree.elementtree.html. Sorry for the code:

def LookThrougStructure(parent, xpath_str, stop_flag):
    out_str.write('Parent tag: %s\n' % (parent.tag))
    for child in parent:
        if child.tag == my_tag:
            out_str.write('Child tag: %s\n' % (child.tag))
            #my_node_is_found_flag = 1
            break
        LookThrougStructure(child, child.tag, 0)
    return  
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
my_tag = 'neighbor'
out_str = open('xml_structure.txt', 'w')
LookThrougStructure(root, root.tag, my_tag)
out_str.close()

It works wrong and yelds all node tags:

Parent tag: data Parent tag: country Parent tag: rank Parent tag: year Parent tag: gdppc Child tag: neighbor Parent tag: country Parent tag: rank Parent tag: year Parent tag: gdppc Child tag: neighbor Parent tag: country Parent tag: rank Parent tag: year Parent tag: gdppc Child tag: neighbor

But I want something like that (my interested child is"neighbor"):

Or that: /data/country/neighbor. What is wrong?

Upvotes: 1

Views: 96

Answers (2)

E Bobrov
E Bobrov

Reputation: 135

@Padraic. Thanks a lot! Your code is mostly what I want. But if I insert additional node (for example attributes) which is child of country node and parent for the neighbor node it gives unexpected results:

<data>
<country name="Liechtenstein">
<attributes>
    <rank>1</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E"/>
    <neighbor name="Switzerland" direction="W"/>
    </attributes>
</country>
<country name="Singapore">
<attributes>
    <rank>4</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
    <neighbor name="Malaysia" direction="N"/>
    </attributes>
</country>
<country name="Panama">
<attributes>
    <rank>68</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
    <neighbor name="Costa Rica" direction="W"/>
    <neighbor name="Colombia" direction="E"/>
    </attributes>
</country>

Anyway your help was very fruitfull. I take your code and create this one:

import lxml.etree as et
root = et.parse('country_data.xml')

out_f = open('getpath.txt', 'w')

my_str1 = 'country[1]'
my_str2 = 'neighbor[1]'

for e in root.iter():
    s = root.getelementpath(e)
    if my_str1 not in s:
        continue
    if my_str2 not in s:
        continue
    out_f.write('%s\n' %(s))
    break
out_f.close()

The idea is simple: if elementpath has string 'country' and 'neighbor' it is writed down to the output file. For the original xml example it gives: country[1]/neighbor[1]. And for xml with additional parent it gives: country[1]/attributes/neighbor[1].

Upvotes: 1

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

If I understand you correctly you want something like:

look_through_structure(parent, my_tag):
    for node in parent.iter("*"):
        out_str.write('Parent tag: %s\n' % node.tag)
        for nxt in node:
            if nxt.tag == my_tag:
                out_str.write('child tag: %s\n' % my_tag)
                return
            out_str.write('Parent tag: %s\n' % nxt.tag)
            if any(ch.tag == my_tag for ch in nxt.getchildren()):
                out_str.write('child tag: %s\n' % my_tag)
                return

If we change the function a bit and yield the tags:

def look_through_structure(parent, my_tag):
    for node in parent.iter("*"):
        yield node.tag
        for nxt in node:
            if nxt.tag == my_tag:
                yield nxt.tag
                return
            yield nxt.tag
            if any(ch.tag == my_tag for ch in nxt.getchildren()):
                yield my_tag
                return

And run it on the file:

In [24]: root = tree.getroot()

In [25]: my_tag = 'neighbor'

In [26]: list(look_through_structure(root, my_tag))
Out[26]: ['data', 'country', 'neighbor']

Also if you just wanted the full path, lxml's getpath would do that for you:

import lxml.etree as ET

tree = ET.parse('country.xml')

my_tag = 'neighbor'

print(tree.getpath(tree.find(".//neighbor")))

Output:

/data/country[1]/neighbor[1]

Upvotes: 1

Related Questions