Michael
Michael

Reputation: 21

lxml - UnicodeDecodeError when accessing text of an element

I am working with some Python code that uses the lxml HTML parser to parse the HTML that a co-worker scraped from a random sample of web sites.

In two of them, I get an error of the form

"'utf8' codec can't decode byte 0xe20x80 in position 502: unexpected end of data",

and the HTML content does contain a corrupt UTF-8 character.

A variable in the code called ele is assigned to a <p> element surrounding the text with the bad character, and that text can be accessed via ele.text. Or it could be, but merely assigning ele.text to another variable causes the UnicodeDecodeError to be raised. The object of type UnicodeDecodeError that is available in the except clause contains some useful attributes such as the start and end positions of the bad bytes in the text, which could be used to create a new string from which the bad bytes have been removed, but doing anything to ele.text, such as taking a substring of it, causes a new UnicodeDetectError to be raised. Is there anything I can do to salvage the good parts of ele.text?

I am writing this from memory, and I don't remember all the details of the code, so I can supply more information tomorrow if it's useful. What I remember is that ele is an object of a type something like lxml._Element, the file being parsed really is in utf-8, and there is a place in the file where the first two utf-8 bytes of the the character that matches the entity &rdquo; is followed by the entity &rdquo;. So the text contains "xE2x80&amp;rdquo;". The error message complains about the "xE2x80" and gives their position in a string that has about 520 characters in it. I could discard the whole string if necessary, but I'd rather just use the position info to discard the "xE2x80". For some reason, doing anything with ele.text causes an error in lower level Cython code in lxml. I can provide the stack trace tomorrow when I am at work. What, if anything can I do with that text? Thanks.

Upvotes: 2

Views: 2675

Answers (1)

jfs
jfs

Reputation: 414395

e2 80 bytes by themselves do not cause the error:

from lxml import html

html_data = b"<p>before &ldquo;\xe2\x80&rdquo; after"
p = html.fromstring(html_data)
print(repr(p.text))
# -> u'before \u201c\xe2\x80\u201d after'

As @Esailija pointed out in the comments the above doesn't interpret the data as utf-8. To force utf-8 encoding:

from lxml import html

html_data = b"""<meta http-equiv="content-type"
                      content="text/html; charset=UTF-8">
                <p>before &ldquo;\xe2\x80&rdquo; after"""
doc = html.fromstring(html_data.decode('utf-8','ignore'))
print(repr(doc.find('.//p').text))
# -> u'before \u201c\u201d after'
  • check that utf-8 is the correct character encoding for the document
  • replace the broken byte sequence before passing it to lxml

Upvotes: 1

Related Questions