How do I use lxml and python to traverse the of a html document along with its children

Question

I would like to take an html document and traverse the part of the document with its children. I see lots of examples to get a subtree via xpath or tag name but this doesn't seem to give the children.

import lxml
from lxml import html, etree  

html3 = "test<body><h1>page title</h3><p>some text</p>"
root = lxml.html.fromstring(html3)
tree = etree.ElementTree(root)
for el in root.iter():
    # do something
    print(el.text, tree.getpath(el))
</code></pre>

<p>This will output </p>

<pre><code>None /html
None /html/head
test /html/head/title
None /html/body
page title /html/body/h1
some text /html/body/p
</code></pre>

<p>I would like only </p>

<pre><code>page title /html/body/h1
some text /html/body/p
</code></pre>

<p>Any help gratefully received.</p>

Manjit Ullal · Accepted Answer

I had similar difficulty, then I figured that each etree node has an iterator if its parent using which you can traverse

for instance, root here will give you the body using that you can iterate each element of body

from lxml import etree
parser = etree.HTMLParser()
tree   = etree.parse('yourdocument.html', parser)

root = tree.xpath('/html/body/')[0]
for i in root.getiterator():
    print(i.tag,i.text)

How do I use lxml and python to traverse the <body> of a html document along with its children

Answers (2)

Related Questions

How do I use lxml and python to traverse the &lt;body&gt; of a html document along with its children

Answers (2)

Related Questions

How do I use lxml and python to traverse the <body> of a html document along with its children