Reputation: 166
I would like to take an html document and traverse the <body>
part of the document with its children. I see lots of examples to get a subtree via xpath or tag name but this doesn't seem to give the children.
import lxml
from lxml import html, etree
html3 = "<html><head><title>test<body><h1>page title</h3><p>some text</p>"
root = lxml.html.fromstring(html3)
tree = etree.ElementTree(root)
for el in root.iter():
# do something
print(el.text, tree.getpath(el))
This will output
None /html
None /html/head
test /html/head/title
None /html/body
page title /html/body/h1
some text /html/body/p
I would like only
page title /html/body/h1
some text /html/body/p
Any help gratefully received.
Upvotes: 1
Views: 1823
Reputation: 339
It seems that your html code has an invalid format, I just wrote a little program with beautifuSoup that maybe you can use to modify for your purpose:
from bs4 import BeautifulSoup
html3 = "<html><head><title>test</title></head><body><h1>page title</h1><p>some text</p><body></html>"
soup = BeautifulSoup(html3, "html5lib")
body = soup.find('body')
for item in body.findChildren():
print(item)
Output
<h1>page title</h1>
<p>some text</p>
Upvotes: 0
Reputation: 106
I had similar difficulty, then I figured that each etree node has an iterator if its parent using which you can traverse
for instance, root here will give you the body using that you can iterate each element of body
from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('yourdocument.html', parser)
root = tree.xpath('/html/body/')[0]
for i in root.getiterator():
print(i.tag,i.text)
Upvotes: 2