Parse html with lxml (tag h3)

Question

I'm trying to parse some html and I have some problem with this little html code.

XML:


    
    Other
    Other

    Indice

code:

import urllib
from lxml import etree
import StringIO
resultado=urllib.urlopen('trozo.html')
html = resultado.read()
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
xpath='/div/h3'
html_filtrado=tree.xpath(xpath)
print html_filtrado

When I print the code it appears [], and I suppose that It should be a list with

`Other`

in it. If I would have that list I would execute etree.tostring(html_filtrado) to see Other.

So how can get this code?

Other

Or only ../url ? which is the part I want!!

Thank you

Pavel Shvedov · Accepted Answer

The case is, that etree.HTMLParser() when receives HTML, it creates the full html DOM tree. So, instead of what you intended, if you use etree.tostring(tree) you get




Other
Other
Indice

So, the correct xpath would be '/html/body/div/h3'

Parse html with lxml (tag h3)

Answers (2)

Related Questions