Reputation: 14532

xml.etree.ElementTree get node depth

The XML:

<?xml version="1.0"?>
<pages>
    <page>
        <url>http://example.com/Labs</url>
        <title>Labs</title>
        <subpages>
            <page>
                <url>http://example.com/Labs/Email</url>
                <title>Email</title>
                <subpages>
                    <page/>
                    <url>http://example.com/Labs/Email/How_to</url>
                    <title>How-To</title>
                </subpages>
            </page>
            <page>
                <url>http://example.com/Labs/Social</url>
                <title>Social</title>
            </page>
        </subpages>
    </page>
    <page>
        <url>http://example.com/Tests</url>
        <title>Tests</title>
        <subpages>
            <page>
                <url>http://example.com/Tests/Email</url>
                <title>Email</title>
                <subpages>
                    <page/>
                    <url>http://example.com/Tests/Email/How_to</url>
                    <title>How-To</title>
                </subpages>
            </page>
            <page>
                <url>http://example.com/Tests/Social</url>
                <title>Social</title>
            </page>
        </subpages>
    </page>
</pages>

The code:

// rexml is the XML string read from a URL
from xml.etree import ElementTree as ET
tree = ET.fromstring(rexml)
for node in tree.iter('page'):
    for url in node.iterfind('url'):
        print url.text
    for title in node.iterfind('title'):
        print title.text.encode("utf-8")
    print '-' * 30

The output:

http://example.com/article1
Article1
------------------------------
http://example.com/article1/subarticle1
SubArticle1
------------------------------
http://example.com/article2
Article2
------------------------------
http://example.com/article3
Article3
------------------------------

The Xml represents a tree like structure of a sitemap.

I have been up and down the docs and Google all day and can't figure it out hot to get the node depth of entries.

I used counting of the children container but that only works for the first parent and then it breaks as I can't figure it out how to reset. But this is probably just a hackish idea.

The desired output:

0
http://example.com/article1
Article1
------------------------------
1
http://example.com/article1/subarticle1
SubArticle1
------------------------------
0
http://example.com/article2
Article2
------------------------------
0
http://example.com/article3
Article3
------------------------------

Upvotes: 17

Answers (7)

JGFMK

Reputation: 8904

I think it's far easier to count the number of '/' symbols in the Xpath.
Basically get the Xpath and remove all characters that are not a '/' with a simple regex.
Then return the length.

class Base_Node(object):
    def __init__(self, element:etree.Element, index:int):
        self.element = element
        self.index = index
        self._d = {}
        for attr in self.element.items():
            self._d[attr[0].lower()]=attr[1]

    def __str__(self) -> str:
        return f'tag: {self.tag} path: {self._path} depth: {self.depth}'
    
    @property
    def tag(self) -> str:
        return self.element.tag
    
    @property
    def _path(self) -> str:
        return self.element.getroottree().getpath(self.element)
    
    @property
    def depth(self) -> int:
        import re
        r = re.sub('[^/]','',self._path)
        return len(r)
    
    @property
    def sourceline(self) -> int:
        return self.element.sourceline

Upvotes: 0

Tony Valero

Reputation: 11

My approach, recursive function to list with level. You must first set the initial dept of the node you are passing:

# Definition of recursive function
def listchildrens(node,depth):
    # Print node, indent with depth 
    print(" " * depth,"Type",node.tag,"Attributes",node.attrib,"Depth":depth}
    # If node has childs, recall function for the node with existing depth
    if len(node) > 0:
        # Increase depth and recall function
        depth+= 1
        for child in node:
            listchildrens(node,depth)
# Define starting depth
startdepth = 1
# Call the function with the XML body and starting depth
listchildrens(xmlBody,startdepth)

Upvotes: 1

Rachit

Reputation: 61

import xml.etree.ElementTree as etree
tree = etree.ElementTree(etree.fromstring(rexml)) 
maxdepth = 0
def depth(elem, level): 
   """function to get the maxdepth"""
    global maxdepth
    if (level == maxdepth):
        maxdepth += 1
   # recursive call to function to get the depth
    for child in elem:
        depth(child, level + 1) 


depth(tree.getroot(), -1)
print(maxdepth)

Upvotes: 6

Darkstar Dream

Reputation: 1859

This is another easy way of doing this without using an XML library:

depth = 0
for i in range(int(input())):
    tab = input().count('    ')
    if tab > depth:
        depth = tab
print(depth)

Upvotes: -2

maxschlepzig

Reputation: 39205

The Python ElementTree API provides iterators for depth-first traversal of a XML tree - unfortunately, those iterators don't provide any depth information to the caller.

But you can write a depth-first iterator that also returns the depth information for each element:

import xml.etree.ElementTree as ET

def depth_iter(element, tag=None):
    stack = []
    stack.append(iter([element]))
    while stack:
        e = next(stack[-1], None)
        if e == None:
            stack.pop()
        else:
            stack.append(iter(e))
            if tag == None or e.tag == tag:
                yield (e, len(stack) - 1)

Note that this is more efficient than determining the depth via following the parent links (when using lxml) - i.e. it is O(n) vs. O(n log n).

Upvotes: 12

Txema

Reputation: 857

lxml is best for this, but if you have to use the standard library, do not use iter and go walking the tree, so you can know where you are.

from xml.etree import ElementTree as ET
tree = ET.fromstring(rexml)

def sub(node, tag):
    return node.findall(tag) or []

def print_page(node, depth):
    print "%s" % depth
    url = node.find("url")
    if url is not None:
        print url.text
    title = node.find("title")
    if title is not None:
        print title.text
    print '-' * 30

def find_pages(node, depth=0):
    for page in sub(node, "page"):
        print_page(page, depth)
        subpage = page.find("subpages")
        if subpage is not None:
            find_pages(subpage, depth+1)

find_pages(tree)

Upvotes: 0

falsetru

Reputation: 369424

Used lxml.html.

import lxml.html

rexml = ...

def depth(node):
    d = 0
    while node is not None:
        d += 1
        node = node.getparent()
    return d

tree = lxml.html.fromstring(rexml)
for node in tree.iter('page'):
    print depth(node)
    for url in node.iterfind('url'):
        print url.text
    for title in node.iterfind('title'):
        print title.text.encode("utf-8")
    print '-' * 30

Upvotes: 4

xml.etree.ElementTree get node depth

Answers (7)

Related Questions