Reputation: 110592
I am trying to parse HTML, but unfortunately lxml
is not allowing me to grab the actual text:
node = lxml.html.fromstring(r.content)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']
# @@#### Démineurs
What do I need to do to correctly parse this text? Here is the web page: https://play.google.com/store/movies/details/D%C3%A9mineurs?id=KChu8wf5eVo&hl=fr and the text should be Démineurs.
Upvotes: 6
Views: 4683
Reputation: 30260
It's just an encoding issue.
It looks like you're using requests, which is good, because it does this work for you.
First, requests guesses at the encoding, which you can access with r.encoding
. For that page, requests guessed at utf-8.
You could do:
data = r.content.decode('UTF-8')
# or
data = r.content.decode(r.encoding)
# then
node = lxml.html.fromstring(data)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']
which works:
@@#### Démineurs
But better yet, just use the text
attribute, which has the output already decoded correctly.
node = lxml.html.fromstring(r.text)
self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text
print '@@####', self.fingerprint['Title']
works:
@@#### Démineurs
Upvotes: 5
Reputation: 799580
The document has no encoding information, therefore you need to create a parser that uses the correct encoding by default.
>>> lxml.html.fromstring('<p>é</p>').text
u'\xc3\xa9'
>>> hp = lxml.etree.HTMLParser(encoding='utf-8')
>>> lxml.html.fromstring('<p>é</p>', parser=hp).text
u'\xe9'
Upvotes: 7