Encoding error while fetching HTML

Question

On Python3.2 I am getting following error when trying to get HTML from remote site, it works well on Python 2.7

enter image description here

Code:

def connectAmazon():
    usleep = lambda x: sleep(x/1000000.0)
    factor = 400
    shouldRetry = True
    retries = 0
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36'}
    attempt = 0
    while shouldRetry == True:
        random = randint(2, 9)
        attempt += 1
        print ("Attempt#", attempt)
        #print (attempt)
        url = "http://www.amazon.com/gp/offer-listing/B009OZUPUC/sr=/qid=/ref=olp_prime_new?ie=UTF8&colid=&coliid=&condition=new&me=&qid=&seller=&shipPromoFilter=1&sort=sip&sr"
        html = requests.get(url)
        status = html.status_code
        if status == 200:
            shouldRetry = False
            print ("Success. Check HTML Below")
            print(html.text) #The Buggy Line
            break
        elif status == 503:
            retries += 1
            delay = random * (pow(retries, 4)*100)
            print ("Delay(ms) = ", delay)
            #print (delay)
            usleep(delay)
            shouldRetry = True


connectAmazon()

What to be done to resolve this on Python 3.2 or Py 3.x?

Paulo Bu · Accepted Answer

Ok, Windows Command Line is very problematic with encodings^*. The encoding error is because when outputting, print is encoding html.text into the cmd encoding (you can know which one it is by issuing command chcp). There is probably one char in html.text than can't be encoded in cmd's encoding.

My solution for Python3 would be forcing an output encoding. Sadly, in Python3 this is a little more problematic than I would like. You'll need to replace the line print(html.text) for:

import sys
sys.stdout.buffer.write(html.text.encode('utf8'))

Of course, that line won't work in Python2. In Python2 you can just encode your output before printing it so print(html.text) can be replaced with:

print html.text.encode('utf8')

Important note: In Python2 print is a keyword, not a function. So calling print('hi') works because print is printing the expression inside the parenthesis. When you do print('hi',2) you'll get the tuple ('hi',2) outputted. That's not exactly what you want. It works by miracle :D

Hope this helps!

_{* This is due to its lack of support to utf8. They have a weird 650001 code page which is not entirely the same as utf-8 and Python does not work with it.}

Encoding error while fetching HTML

Answers (1)

Related Questions