Reputation: 2256
I tried many different solutions for the following problem and I couldn't find one that works at the time being. I need to get some information from meta tags in several webpages. For this purpose I found lxml very useful because I also need to find specific content using xpath to parse it. XPath works on the tree, however, I have a 20% of websites (in a total around 100) that don't work, specifically head seems to be broken.
tree = html.fromstring(htmlfrompage) // using html from lxml package
head_object = tree.head // access to head object from this webpage
In all of these websites accessing head object (which is only a shortcut to xpath) fails with the same error:
print tree.head
IndexError: list index out of range
Because the following xpath fails:
self.xpath('//head|//x:head', namespaces={'x':XHTML_NAMESPACE})[0]
This xpath is empty so accessing the first element fails. I was navigating the tree myself and self.xpath('//head') or self.xpath('//html/head') or even self.xpath('//body') is empty. But if I try to access meta tags directly in any place of the document:
head = tree.xpath("//meta")
for meta_tag in head:
print meta_tag.text # Just printing something
It works, so it means somehow metas are not connected to the head, but they're somewhere floating in the tree. Head doesn't exist anyway. Of course I can try to "patch" this issue accessing head and in case I get an index out of range exception I could navigate metas to find what I'm looking for but I expected lxml fixes broken html (as I read in the documentation).
Is there anybody that had the same issue and could solve it in a better way?
Upvotes: 1
Views: 484
Reputation: 1123042
Using requests
I can load the tree just fine:
>>> import requests
>>> from lxml import html
>>> r = requests.get('http://www.lanacion.com.ar/1694725-ciccone-manana-debera-declarar-carosso-donatiello-el-inquilino-de-boudou')
>>> tree = html.fromstring(r.content)
>>> tree.head
<Element head at 0x10681b4c8>
Do note that you want to pass a byte string to html.fromstring()
; don't use r.text
as that'll pass in Unicode instead.
Moreover, if the server did not indicate the encoding in the headers, requests
falls back to the HTTP RFC default, which is ISO-8859-1 for text/
responses. For this specific response that is incorrect:
>>> r.headers['Content-Type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding # make an educated guess
'utf-8'
This means r.text
will use Latin-1 to decode the UTF-8 data, leading to an incorrectly decoded Unicode string, further confusing matters.
The HTML parser, on the other hand, can make use of the <meta>
header present to tell it what encoding to use:
>>> tree.find('.//meta').attrib
{'content': 'text/html; charset=utf-8', 'http-equiv': 'Content-Type'}
Upvotes: 1