user1561108
user1561108

Reputation: 2747

What is this encoding and how to I convert it?

I am pulling text out from an html tag attribute using lxml and xpath via tag.attrib['title']. I get:

Málaga Airport

where in the browser with the same url I am seeing:

Málaga Airport

How do I convert the former to the later?

Upvotes: 1

Views: 131

Answers (1)

ekhumoro
ekhumoro

Reputation: 120628

It seems that the lxml html parser assumes a 'latin1' encoding for byte strings.

So unless the input is encoded as 'latin1' (or 'ascii'), the encoding needs to be specified explicitly. In this case, it looks like it should be 'utf-8':

>>> from lxml import etree
>>>
>>> html = u"""
... <html>
... <head><title>Test</title></head>
... <body>
... <p test="Málaga">Example</p>
... </body>
... </html>
... """
>>>
>>> html = html.encode('utf-8')
>>>
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(html, parser)
>>> print tree.xpath('//p/@test')[0]
Málaga
>>>
>>> parser = etree.HTMLParser(encoding='utf-8')
>>> tree = etree.fromstring(html, parser)
>>> print tree.xpath('//p/@test')[0]
Málaga

Upvotes: 2

Related Questions