Python3 html and lxml parser encoding problem

Question

When parsing some HTML using BeautifulSoup or PyQuery, they will use a parser like lxml or html5lib. Let's say I've a file containing the following

  é    and    ’

In my environnement they seems incorrectly encoded, using PyQuery:

>>> doc = pq(filename=PATH, parser="xml")
>>> doc.text()
'Ã© and â\u20ac\u2122'
>>> doc = pq(filename=PATH, parser="html")
>>> doc.text()
'Ã\x83Â© and Ã¢â\x82¬â\x84¢'
>>> doc = pq(filename=PATH, parser="soup")
>>> doc.text()
'Ã© and â\u20ac\u2122'
>>> doc = pq(filename=PATH, parser="html5")
>>> doc.text()
'Ã© and â\u20ac\u2122'

Beyond the fact that the encoding seems incorrect, one of the main problem is that doc.text() returns an instance of str instead of bytes which isn't a normal thing according to that question I asked yesterday.

Also, passing the argument encoding='utf-8' to PyQuery seems useless, I tried 'latin1' nothing change. I also tried to add some meta data because I read that lxml read them to figure out what encoding to use but it doesn't change anything:





  é    and    ’

If I use lxml directly it seems a bit different

>>> from lxml import etree
>>> tree = etree.parse(PATH)
>>> tree.docinfo.encoding
'UTF-8'

>>> result = etree.tostring(tree.getroot(), pretty_print=False)
>>> result
b'  é    and    ’  '

>>> import html
>>> html.unescape(result.decode('utf-8'))
'  é    and    \u2019  
'

Erf, It drives me a bit crazy, your help would be appreciated

Python3 html and lxml parser encoding problem

Answers (1)

Related Questions