David542
David542

Reputation: 110592

lxml not parsing unicode properly for HTML

I am trying to parse HTML, but unfortunately lxml is not allowing me to grab the actual text:

node = lxml.html.fromstring(r.content)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']

# @@#### Démineurs

What do I need to do to correctly parse this text? Here is the web page: https://play.google.com/store/movies/details/D%C3%A9mineurs?id=KChu8wf5eVo&hl=fr and the text should be Démineurs.

Upvotes: 6

Views: 4683

Answers (2)

jedwards
jedwards

Reputation: 30260

It's just an encoding issue.

It looks like you're using requests, which is good, because it does this work for you.

First, requests guesses at the encoding, which you can access with r.encoding. For that page, requests guessed at utf-8.

You could do:

data = r.content.decode('UTF-8')
# or
data = r.content.decode(r.encoding)
# then
node = lxml.html.fromstring(data)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']

which works:

@@#### Démineurs

But better yet, just use the text attribute, which has the output already decoded correctly.

node = lxml.html.fromstring(r.text)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']

works:

@@#### Démineurs

Upvotes: 5

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799580

The document has no encoding information, therefore you need to create a parser that uses the correct encoding by default.

>>> lxml.html.fromstring('<p>é</p>').text
u'\xc3\xa9'
>>> hp = lxml.etree.HTMLParser(encoding='utf-8')
>>> lxml.html.fromstring('<p>é</p>', parser=hp).text
u'\xe9'

Upvotes: 7

Related Questions