Reputation: 38629
Here is the MWE, test.py
- the test webpage that is written inline as mypage
, is served from http://sdaaubckp.sourceforge.net/test/test-utf8.html , so you should be able to run this script as-is:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys
import re
import lxml.html as LH
import requests
if sys.version_info[0]<3: # python 2
from StringIO import StringIO
else: #python 3
from io import StringIO
# this page uploaded on: http://sdaaubckp.sourceforge.net/test/test-utf8.html
mypage = """
<!doctype html>
<html lang="en">
<head>
<!-- Basic Page Needs
–––––––––––––––––––––––––––––––––––––––––––––––––– -->
<meta charset="utf-8">
<title>My Page</title>
<meta name="description" content="">
<meta name="author" content="">
</head>
<body>
<div>Testing: tøst</div>
</body>
</html>
"""
url_page = "http://sdaaubckp.sourceforge.net/test/test-utf8.html"
confpage = requests.get(url_page)
print(confpage.encoding) # it detects ISO-8859-1, even if the page declares <meta charset="utf-8">?
confpage.encoding = "UTF-8"
print(confpage.encoding) # now it says UTF-8, but...
#print(confpage.content)
if sys.version_info[0]<3: # python 2
mystr = confpage.content
else: #python 3
mystr = confpage.content.decode("utf-8")
for line in iter(mystr.splitlines()):
if 'Testing' in line:
print(line)
confpagetree = LH.fromstring(confpage.content)
print(confpagetree) # <Element html at 0x7f4b7074eec0>
#print(confpagetree.text_content())
for line in iter(confpagetree.text_content().splitlines()):
if 'Testing' in line:
print(line)
I'm running this on Ubuntu 14.04.5 LTS; both Python 2 and 3 give the same results with this script:
$ python2 test.py
ISO-8859-1
UTF-8
<div>Testing: tøst</div>
<Element html at 0x7fb5b9d12ec0>
Testing: tøst
$ python3 test.py
ISO-8859-1
UTF-8
<div>Testing: tøst</div>
<Element html at 0x7f272fc53318>
Testing: tøst
Note how:
confpage.encoding
detects ISO-8859-1
, even if the webpage declares <meta charset="utf-8">
ø
is printed from confpage.content
ø
is output from lxml.html.fromstring(confpage.content).text_content()
My suspicion is, since the webpage uses –
UTF-8 character (Char: '–' u: 8211 [0x2013] b: 226,128,147 [0xE2,0x80,0x93] n: EN DASH [General Punctuation]) before it declares <meta charset="utf-8">
in the <head>
, this somehow borks requests
and/or lxml.html.fromstring().text_content()
, which results with the corrupt representation.
My question is - what can I do, so I get a correct UTF-8 character at the output of lxml.html.fromstring().text_content()
- hopefully for both Python 2 and 3?
Upvotes: 0
Views: 576
Reputation: 365767
The root problem is that you're using confpage.content
instead of confpage.text
.
requests.Response.content
gives you the raw bytes (bytes
in 3.x, str
in 2.x), as pulled off the wire. It doesn't matter what encoding
is, because you aren't using it.requests.Response.text
gives you the decoded Unicode (str
in 3.x, unicode
in 2.x), based on the encoding
.So, setting the encoding
but then using content
doesn't do anything. If you just change the rest of your code to use text
instead of content
(and get rid of the now-spurious decode
for Python 3), it will work:
mystr = confpage.text
for line in iter(mystr.splitlines()):
if 'Testing' in line:
print(line)
confpagetree = LH.fromstring(confpage.text)
print(confpagetree) # <Element html at 0x7f4b7074eec0>
#print(confpagetree.text_content())
for line in iter(confpagetree.text_content().splitlines()):
if 'Testing' in line:
print(line)
If you want to go through the exact problem with each of your examples:
decode("utf-8")
on the content
, since the bytes do happen to be UTF-8, you're decoding them properly. So they will print out properly.content
, which is a bunch of UTF-8 bytes. If your console is UTF-8 (as it is on macOS, and might be on Linux), this will happen to work. If your console is something else, like cp1252 or Latin-1 (as it is on Windows, and might be on Linux), this will give you mojibake.LH.fromstring
, you're forcing lxml to guess what encoding to use, and it guesses Latin-1, so you get mojibake.Upvotes: 2