How to get text of broken html with lxml

Question

Here's what I have:

r = requests.get("http://www.cnn.com/")
htmlelement = lxml.html.fromstring(r.text)
html = lxml.html.tostring(htmlelement)
tree = lxml.etree.fromstring(html)
print tree.xpath('//*[@id="cnn_maintt1imgbul"]/div/div[2]/div/h1/a')

I thought xml.html corrected the broken html?

The error is:

XMLSyntaxError: Opening and ending tag mismatch: link line 32 and head, line 75, column 8

Thanks!

larsks · Accepted Answer

I don't understand why you're trying to reparse the content after getting this far:

>>> htmlelement = lxml.html.fromstring(r.text)

Because at this point you can simply apply your xpath expression:

>>> results = htmlelement.xpath('//*[@id="cnn_maintt1imgbul"]/div/div[2]/div/h1/a')
>>> results
[]
>>> print lxml.html.tostring(results[0])
'SOUTH KOREAN PRIME MINISTER RESIGNS'

I believe your problem is that lxml.html.tostring() still generates HTML, which you then try to parse with the XML parser.

How to get text of broken html with lxml

Answers (1)

Related Questions