Reputation: 341
I'm trying to parse some html and I have some problem with this little html code.
XML:
<div>
<p><span><a href="../url"></a></span></p>
<h3 class="header"><a href="../url">Other</a></h3>
<a href="../url">Other</a><br>
<a class="aaaaa" href="../url">Indice</a>
<p></p>
</div>
code:
import urllib
from lxml import etree
import StringIO
resultado=urllib.urlopen('trozo.html')
html = resultado.read()
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
xpath='/div/h3'
html_filtrado=tree.xpath(xpath)
print html_filtrado
When I print the code it appears [], and I suppose that It should be a list with <h3 class="header"><a href="../url">Other</a></h3>
in it.
If I would have that list I would execute etree.tostring(html_filtrado) to see <h3 class="header"><a href="../url">Other</a></h3>
.
So how can get this code?
<h3 class="header"><a href="../url">Other</a></h3>
Or only ../url
? which is the part I want!!
Thank you
Upvotes: 1
Views: 2547
Reputation: 120778
The XPath query in your example is not quite right.
To get a list of all h3
tags within div
tags, you should use this:
elements = tree.xpath('//div/h3')
etree.tostring(elements[0])
Which should give:
'<h3 class="header"><a href="../url">Other</a></h3>\n'
To get a list of all href
attributes of a
tags within h3
tags, you could use something like this:
tree.xpath('//h3/a/@href')
Which gives:
['../url']
Upvotes: 4
Reputation: 1314
The case is, that etree.HTMLParser() when receives HTML, it creates the full html DOM tree. So, instead of what you intended, if you use etree.tostring(tree) you get
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<p><span><a href="../url"/></span></p>
<h3 class="header"><a href="../url">Other</a></h3>
<a href="../url">Other</a><br/><a class="aaaaa" href="../url">Indice</a>
<p/>
So, the correct xpath would be '/html/body/div/h3'
Upvotes: 3