magneto
magneto

Reputation: 113

python 3 - HTTP proxy issue

I'm using python 3.3.0 in Windows 7.

I made this script to bypass http proxy without authentication on a system. But when I execute, it gives the error:UnicodeEncodeError: 'charmap' codec can't encode characters in position 6242-6243: character maps to <undefined> It seems that it fails to decode unicode characters into a string.

So, what should I use or edit/do? Do anybody have any clue or solution?

my .py contains following:

import sys, urllib
import urllib.request

url = "http://www.python.org"
proxies = {'http': 'http://199.91.174.6:3128/'}

opener = urllib.request.FancyURLopener(proxies)

try:
    f = urllib.request.urlopen(url)
except urllib.error.HTTPError as  e:
    print ("[!] The connection could not be established.")
    print ("[!] Error code: ",  e.code)
    sys.exit(1)
except urllib.error.URLError as  e:
    print ("[!] The connection could not be established.")
    print ("[!] Reason: ",  e.reason)
    sys.exit(1)

source = f.read()

if "iso-8859-1" in str(source):
    source = source.decode('iso-8859-1')
else:
    source = source.decode('utf-8')

print("\n SOURCE:\n",source)

Upvotes: 0

Views: 2559

Answers (1)

t-8ch
t-8ch

Reputation: 2713

  1. This code doesn't even use your proxy
  2. This form of encoding detection is really weak. You should only look for the declared encoding in the well defined locations: HTTP header 'Content-Type' and if the response is HTML in the charset meta-tag.
  3. As you didn't include a stacktrace I assume the error happended in the line if "iso-8859-1" in str(source):. The call to str() decodes the bytes data using your systems default encoding (sys.getdefaultencoding()). If you really want to keep this check (see point 2) you should do if b"iso-8859-1" in source: This works on bytes instead of strings so no decoding has to be done beforehand.

Note: This code works fine for me, presumably because my system uses a default encoding of utf-8 while your windows system uses something different.

Update: I recommend using python-requests when doing http in python.

import requests

proxies = {'http': your_proxy_here}

with requests.Session(proxies=proxies) as sess:
    r = sess.get('http://httpbin.org/ip')
    print(r.apparent_encoding)
    print(r.text)
    # more requests

Note: this doesn't use the encoding specified in the HTML, you would need a HTML parser like beautifulsoup to extract that.

Upvotes: 2

Related Questions