sdaau
sdaau

Reputation: 38629

Getting proper UTF-8 from lxml.html.fromstring via requests.get from HTML page?

Here is the MWE, test.py - the test webpage that is written inline as mypage, is served from http://sdaaubckp.sourceforge.net/test/test-utf8.html , so you should be able to run this script as-is:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os, sys
import re
import lxml.html as LH
import requests
if sys.version_info[0]<3: # python 2
  from StringIO import StringIO
else: #python 3
  from io import StringIO


# this page uploaded on: http://sdaaubckp.sourceforge.net/test/test-utf8.html
mypage = """
<!doctype html>
<html lang="en">

<head>

  <!-- Basic Page Needs
  –––––––––––––––––––––––––––––––––––––––––––––––––– -->
  <meta charset="utf-8">
  <title>My Page</title>
  <meta name="description" content="">
  <meta name="author" content="">
</head>

<body>
  <div>Testing: tøst</div>
</body>

</html>
"""

url_page = "http://sdaaubckp.sourceforge.net/test/test-utf8.html"

confpage = requests.get(url_page)
print(confpage.encoding) # it detects ISO-8859-1, even if the page declares <meta charset="utf-8">?
confpage.encoding = "UTF-8"
print(confpage.encoding) # now it says UTF-8, but...
#print(confpage.content)
if sys.version_info[0]<3: # python 2
  mystr = confpage.content
else: #python 3
  mystr = confpage.content.decode("utf-8")
for line in iter(mystr.splitlines()):
  if 'Testing' in line:
    print(line)
confpagetree = LH.fromstring(confpage.content)
print(confpagetree) # <Element html at 0x7f4b7074eec0>
#print(confpagetree.text_content())
for line in iter(confpagetree.text_content().splitlines()):
  if 'Testing' in line:
    print(line)

I'm running this on Ubuntu 14.04.5 LTS; both Python 2 and 3 give the same results with this script:

$ python2 test.py 
ISO-8859-1
UTF-8
  <div>Testing: tøst</div>
<Element html at 0x7fb5b9d12ec0>
  Testing: tøst

$ python3 test.py 
ISO-8859-1
UTF-8
  <div>Testing: tøst</div>
<Element html at 0x7f272fc53318>
  Testing: tøst

Note how:

My suspicion is, since the webpage uses UTF-8 character (Char: '–' u: 8211 [0x2013] b: 226,128,147 [0xE2,0x80,0x93] n: EN DASH [General Punctuation]) before it declares <meta charset="utf-8"> in the <head>, this somehow borks requests and/or lxml.html.fromstring().text_content(), which results with the corrupt representation.

My question is - what can I do, so I get a correct UTF-8 character at the output of lxml.html.fromstring().text_content() - hopefully for both Python 2 and 3?

Upvotes: 0

Views: 576

Answers (1)

abarnert
abarnert

Reputation: 365767

The root problem is that you're using confpage.content instead of confpage.text.

  • requests.Response.content gives you the raw bytes (bytes in 3.x, str in 2.x), as pulled off the wire. It doesn't matter what encoding is, because you aren't using it.
  • requests.Response.text gives you the decoded Unicode (str in 3.x, unicode in 2.x), based on the encoding.

So, setting the encoding but then using content doesn't do anything. If you just change the rest of your code to use text instead of content (and get rid of the now-spurious decode for Python 3), it will work:

mystr = confpage.text
for line in iter(mystr.splitlines()):
  if 'Testing' in line:
    print(line)
confpagetree = LH.fromstring(confpage.text)
print(confpagetree) # <Element html at 0x7f4b7074eec0>
#print(confpagetree.text_content())
for line in iter(confpagetree.text_content().splitlines()):
  if 'Testing' in line:
    print(line)

If you want to go through the exact problem with each of your examples:

  • Your first example is right in Python 3, but not the best way to do it. By calling decode("utf-8") on the content, since the bytes do happen to be UTF-8, you're decoding them properly. So they will print out properly.
  • Your first example is wrong in Python 2. You're just printing the content, which is a bunch of UTF-8 bytes. If your console is UTF-8 (as it is on macOS, and might be on Linux), this will happen to work. If your console is something else, like cp1252 or Latin-1 (as it is on Windows, and might be on Linux), this will give you mojibake.
  • Your second example is also wrong. By passing bytes to LH.fromstring, you're forcing lxml to guess what encoding to use, and it guesses Latin-1, so you get mojibake.

Upvotes: 2

Related Questions