fleaheap
fleaheap

Reputation: 166

How do I use lxml and python to traverse the <body> of a html document along with its children

I would like to take an html document and traverse the <body> part of the document with its children. I see lots of examples to get a subtree via xpath or tag name but this doesn't seem to give the children.

import lxml
from lxml import html, etree  

html3 = "<html><head><title>test<body><h1>page title</h3><p>some text</p>"
root = lxml.html.fromstring(html3)
tree = etree.ElementTree(root)
for el in root.iter():
    # do something
    print(el.text, tree.getpath(el))

This will output

None /html
None /html/head
test /html/head/title
None /html/body
page title /html/body/h1
some text /html/body/p

I would like only

page title /html/body/h1
some text /html/body/p

Any help gratefully received.

Upvotes: 1

Views: 1823

Answers (2)

Alex Lee
Alex Lee

Reputation: 339

It seems that your html code has an invalid format, I just wrote a little program with beautifuSoup that maybe you can use to modify for your purpose:

from bs4 import BeautifulSoup
html3 = "<html><head><title>test</title></head><body><h1>page title</h1><p>some text</p><body></html>"
soup = BeautifulSoup(html3, "html5lib")
body = soup.find('body')

for item in body.findChildren():
    print(item)

Output

<h1>page title</h1>
<p>some text</p>

Upvotes: 0

Manjit Ullal
Manjit Ullal

Reputation: 106

I had similar difficulty, then I figured that each etree node has an iterator if its parent using which you can traverse

for instance, root here will give you the body using that you can iterate each element of body

from lxml import etree
parser = etree.HTMLParser()
tree   = etree.parse('yourdocument.html', parser)

root = tree.xpath('/html/body/')[0]
for i in root.getiterator():
    print(i.tag,i.text)

Upvotes: 2

Related Questions