Reputation: 21
I am working with some Python code that uses the lxml HTML
parser to parse the HTML that a co-worker scraped from a random sample of web sites.
In two of them, I get an error of the form
"'utf8' codec can't decode byte 0xe20x80 in position 502: unexpected end of data",
and the HTML content does contain a corrupt UTF-8
character.
A variable in the code called ele is assigned to a <p> element surrounding the text with the bad character, and that text can be accessed via ele.text. Or it could be, but merely assigning ele.text to another variable causes the UnicodeDecodeError
to be raised. The object of type UnicodeDecodeError
that is available in the except clause contains some useful attributes such as the start and end positions of the bad bytes in the text, which could be used to create a new string from which the bad bytes have been removed, but doing anything to ele.text, such as taking a substring of it, causes a new UnicodeDetectError
to be raised. Is there anything I can do to salvage the good parts of ele.text
?
I am writing this from memory, and I don't remember all the details of the code, so I can supply more information tomorrow if it's useful. What I remember is that ele is an object of a type something like lxml._Element
, the file being parsed really is in utf-8
, and there is a place in the file where the first two utf-8
bytes of the the character that matches the entity ” is followed by the entity ”. So the text contains "xE2x80&rdquo;"
. The error message complains about the "xE2x80"
and gives their position in a string that has about 520 characters in it. I could discard the whole string if necessary, but I'd rather just use the position info to discard the "xE2x80"
. For some reason, doing anything with ele.text causes an error in lower level Cython code in lxml. I can provide the stack trace tomorrow when I am at work. What, if anything can I do with that text? Thanks.
Upvotes: 2
Views: 2675
Reputation: 414395
e2 80
bytes by themselves do not cause the error:
from lxml import html
html_data = b"<p>before “\xe2\x80” after"
p = html.fromstring(html_data)
print(repr(p.text))
# -> u'before \u201c\xe2\x80\u201d after'
As @Esailija pointed out in the comments the above doesn't interpret the data as utf-8. To force utf-8 encoding:
from lxml import html
html_data = b"""<meta http-equiv="content-type"
content="text/html; charset=UTF-8">
<p>before “\xe2\x80” after"""
doc = html.fromstring(html_data.decode('utf-8','ignore'))
print(repr(doc.find('.//p').text))
# -> u'before \u201c\u201d after'
Upvotes: 1