Reputation: 2875
When parsing some HTML using BeautifulSoup
or PyQuery
, they will use a parser like lxml
or html5lib
. Let's say I've a file containing the following
<span> é and ’ </span>
In my environnement they seems incorrectly encoded, using PyQuery
:
>>> doc = pq(filename=PATH, parser="xml")
>>> doc.text()
'é and â\u20ac\u2122'
>>> doc = pq(filename=PATH, parser="html")
>>> doc.text()
'Ã\x83© and ââ\x82¬â\x84¢'
>>> doc = pq(filename=PATH, parser="soup")
>>> doc.text()
'é and â\u20ac\u2122'
>>> doc = pq(filename=PATH, parser="html5")
>>> doc.text()
'é and â\u20ac\u2122'
Beyond the fact that the encoding seems incorrect, one of the main problem is that doc.text()
returns an instance of str
instead of bytes
which isn't a normal thing according to that question I asked yesterday.
Also, passing the argument encoding='utf-8'
to PyQuery
seems useless, I tried 'latin1'
nothing change. I also tried to add some meta data because I read that lxml
read them to figure out what encoding to use but it doesn't change anything:
<!DOCTYPE html>
<html lang="fr" dir="ltr">
<head>
<meta http-equiv="content-type" content="text/html;charset=latin1"/>
<span> é and ’ </span>
</head>
</html>
If I use lxml
directly it seems a bit different
>>> from lxml import etree
>>> tree = etree.parse(PATH)
>>> tree.docinfo.encoding
'UTF-8'
>>> result = etree.tostring(tree.getroot(), pretty_print=False)
>>> result
b'<span> é and ’ </span>'
>>> import html
>>> html.unescape(result.decode('utf-8'))
'<span> é and \u2019 </span>\n'
Erf, It drives me a bit crazy, your help would be appreciated
Upvotes: 0
Views: 344
Reputation: 2875
I think I figured it out. It seems that, even BeautifulSoup or PyQuery enable to do it, it is a bad idea to open directly a file containing some special UTF-8 chars. Especially, what confused me the most is that '’' symbol which seems not handled correctly by my Windows Terminal. So, the solution is to pre-process the file before parsing it:
def pre_process_html_content(html_content, encoding=None):
"""Pre process bytes coming from file or request."""
if not isinstance(html_content, bytes):
raise TypeError("html_content must a bytes not a " + str(type(html_content)))
html_content = html_content.decode(encoding)
# Handle weird symbols here
html_content = html_content.replace('\u2019', "'")
return html_content
def sanitize_html_file(path, encoding=None):
with open(path, 'rb') as f:
content = f.read()
encoding = encoding or 'utf-8'
return pre_process_html_content(content, encoding)
def open_pq(path, parser=None, encoding=None):
"""Macro for open HTML file with PyQuery."""
content = sanitize_html_file(path, encoding)
parser = parser or 'xml'
return pq(content, parser=parser)
doc = open_pq(PATH)
Upvotes: 1