lxml tree head and some other elements broken

Question

I tried many different solutions for the following problem and I couldn't find one that works at the time being. I need to get some information from meta tags in several webpages. For this purpose I found lxml very useful because I also need to find specific content using xpath to parse it. XPath works on the tree, however, I have a 20% of websites (in a total around 100) that don't work, specifically head seems to be broken.

tree = html.fromstring(htmlfrompage)  // using html from lxml package
head_object = tree.head               // access to head object from this webpage

In all of these websites accessing head object (which is only a shortcut to xpath) fails with the same error:

print tree.head
IndexError: list index out of range

Because the following xpath fails:

self.xpath('//head|//x:head', namespaces={'x':XHTML_NAMESPACE})[0]

This xpath is empty so accessing the first element fails. I was navigating the tree myself and self.xpath('//head') or self.xpath('//html/head') or even self.xpath('//body') is empty. But if I try to access meta tags directly in any place of the document:

head = tree.xpath("//meta")
for meta_tag in head:
    print meta_tag.text  # Just printing something

It works, so it means somehow metas are not connected to the head, but they're somewhere floating in the tree. Head doesn't exist anyway. Of course I can try to "patch" this issue accessing head and in case I get an index out of range exception I could navigate metas to find what I'm looking for but I expected lxml fixes broken html (as I read in the documentation).

Is there anybody that had the same issue and could solve it in a better way?

Martijn Pieters · Accepted Answer

Using requests I can load the tree just fine:

>>> import requests
>>> from lxml import html
>>> r = requests.get('http://www.lanacion.com.ar/1694725-ciccone-manana-debera-declarar-carosso-donatiello-el-inquilino-de-boudou')
>>> tree = html.fromstring(r.content)
>>> tree.head

Do note that you want to pass a byte string to html.fromstring(); don't use r.text as that'll pass in Unicode instead.

Moreover, if the server did not indicate the encoding in the headers, requests falls back to the HTTP RFC default, which is ISO-8859-1 for text/ responses. For this specific response that is incorrect:

>>> r.headers['Content-Type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding  # make an educated guess
'utf-8'

This means r.text will use Latin-1 to decode the UTF-8 data, leading to an incorrectly decoded Unicode string, further confusing matters.

The HTML parser, on the other hand, can make use of the header present to tell it what encoding to use:

>>> tree.find('.//meta').attrib
{'content': 'text/html; charset=utf-8', 'http-equiv': 'Content-Type'}

lxml tree head and some other elements broken

Answers (1)

Related Questions