dani
dani

Reputation: 341

Parse html with lxml (tag h3)

I'm trying to parse some html and I have some problem with this little html code.

XML:

<div>
    <p><span><a href="../url"></a></span></p>
    <h3 class="header"><a href="../url">Other</a></h3>
    <a href="../url">Other</a><br>
    <a class="aaaaa" href="../url">Indice</a>
    <p></p>               
</div>

code:

import urllib
from lxml import etree
import StringIO
resultado=urllib.urlopen('trozo.html')
html = resultado.read()
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
xpath='/div/h3'
html_filtrado=tree.xpath(xpath)
print html_filtrado

When I print the code it appears [], and I suppose that It should be a list with <h3 class="header"><a href="../url">Other</a></h3> in it. If I would have that list I would execute etree.tostring(html_filtrado) to see <h3 class="header"><a href="../url">Other</a></h3>.

So how can get this code?

<h3 class="header"><a href="../url">Other</a></h3>

Or only ../url ? which is the part I want!!

Thank you

Upvotes: 1

Views: 2547

Answers (2)

ekhumoro
ekhumoro

Reputation: 120778

The XPath query in your example is not quite right.

To get a list of all h3 tags within div tags, you should use this:

elements = tree.xpath('//div/h3')
etree.tostring(elements[0])

Which should give:

'<h3 class="header"><a href="../url">Other</a></h3>\n'

To get a list of all href attributes of a tags within h3 tags, you could use something like this:

tree.xpath('//h3/a/@href')

Which gives:

['../url']

Upvotes: 4

Pavel Shvedov
Pavel Shvedov

Reputation: 1314

The case is, that etree.HTMLParser() when receives HTML, it creates the full html DOM tree. So, instead of what you intended, if you use etree.tostring(tree) you get

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<p><span><a href="../url"/></span></p>
<h3 class="header"><a href="../url">Other</a></h3>
<a href="../url">Other</a><br/><a class="aaaaa" href="../url">Indice</a>
<p/>               

So, the correct xpath would be '/html/body/div/h3'

Upvotes: 3

Related Questions