Getting proper UTF-8 from lxml.html.fromstring via requests.get from HTML page?

Question

Here is the MWE, test.py - the test webpage that is written inline as mypage, is served from http://sdaaubckp.sourceforge.net/test/test-utf8.html , so you should be able to run this script as-is:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os, sys
import re
import lxml.html as LH
import requests
if sys.version_info[0]<3: # python 2
  from StringIO import StringIO
else: #python 3
  from io import StringIO


# this page uploaded on: http://sdaaubckp.sourceforge.net/test/test-utf8.html
mypage = """





  
  
  My Page
  
  



  Testing: tøst



"""

url_page = "http://sdaaubckp.sourceforge.net/test/test-utf8.html"

confpage = requests.get(url_page)
print(confpage.encoding) # it detects ISO-8859-1, even if the page declares ?
confpage.encoding = "UTF-8"
print(confpage.encoding) # now it says UTF-8, but...
#print(confpage.content)
if sys.version_info[0]<3: # python 2
  mystr = confpage.content
else: #python 3
  mystr = confpage.content.decode("utf-8")
for line in iter(mystr.splitlines()):
  if 'Testing' in line:
    print(line)
confpagetree = LH.fromstring(confpage.content)
print(confpagetree) # 
#print(confpagetree.text_content())
for line in iter(confpagetree.text_content().splitlines()):
  if 'Testing' in line:
    print(line)

I'm running this on Ubuntu 14.04.5 LTS; both Python 2 and 3 give the same results with this script:

$ python2 test.py 
ISO-8859-1
UTF-8
  Testing: tøst

  Testing: tÃ¸st

$ python3 test.py 
ISO-8859-1
UTF-8
  Testing: tøst

  Testing: tÃ¸st

Note how:

In both cases, confpage.encoding detects ISO-8859-1, even if the webpage declares
In both cases, correct UTF-8 character ø is printed from confpage.content
In both cases, corrupt UTF-8 representation Ã¸ is output from lxml.html.fromstring(confpage.content).text_content()

My suspicion is, since the webpage uses – UTF-8 character (Char: '–' u: 8211 [0x2013] b: 226,128,147 [0xE2,0x80,0x93] n: EN DASH [General Punctuation]) before it declares in the , this somehow borks requests and/or lxml.html.fromstring().text_content(), which results with the corrupt representation.

My question is - what can I do, so I get a correct UTF-8 character at the output of lxml.html.fromstring().text_content() - hopefully for both Python 2 and 3?

abarnert · Accepted Answer

The root problem is that you're using confpage.content instead of confpage.text.

requests.Response.content gives you the raw bytes (bytes in 3.x, str in 2.x), as pulled off the wire. It doesn't matter what encoding is, because you aren't using it.
requests.Response.text gives you the decoded Unicode (str in 3.x, unicode in 2.x), based on the encoding.

So, setting the encoding but then using content doesn't do anything. If you just change the rest of your code to use text instead of content (and get rid of the now-spurious decode for Python 3), it will work:

mystr = confpage.text
for line in iter(mystr.splitlines()):
  if 'Testing' in line:
    print(line)
confpagetree = LH.fromstring(confpage.text)
print(confpagetree) # 
#print(confpagetree.text_content())
for line in iter(confpagetree.text_content().splitlines()):
  if 'Testing' in line:
    print(line)

If you want to go through the exact problem with each of your examples:

Your first example is right in Python 3, but not the best way to do it. By calling decode("utf-8") on the content, since the bytes do happen to be UTF-8, you're decoding them properly. So they will print out properly.
Your first example is wrong in Python 2. You're just printing the content, which is a bunch of UTF-8 bytes. If your console is UTF-8 (as it is on macOS, and might be on Linux), this will happen to work. If your console is something else, like cp1252 or Latin-1 (as it is on Windows, and might be on Linux), this will give you mojibake.
Your second example is also wrong. By passing bytes to LH.fromstring, you're forcing lxml to guess what encoding to use, and it guesses Latin-1, so you get mojibake.

Getting proper UTF-8 from lxml.html.fromstring via requests.get from HTML page?

Answers (1)

Related Questions