Justin808
Justin808

Reputation: 21522

UTF-8 in python issues

meh, I'm not a fan of utf-8 in python; can't seem to figure out how to solve this. As you can see I'm already trying to B64 encode the value, but it looks like python is trying to convert it from utf-8 to ascii first...

In general I'm trying to POST form data that has UTF-8 characters with urllib2. I guess in general its the same as How to send utf-8 content in a urllib2 request? though there is no valid answer on that. I'm trying to send only a byte string by base64 encoding it.

Traceback (most recent call last):
  File "load.py", line 165, in <module>
    main()
  File "load.py", line 17, in main
    beers()
  File "load.py", line 157, in beers
    resp = send_post("http://localhost:9000/beers", beer)
  File "load.py", line 64, in send_post
    connection.request ('POST', req.get_selector(), *encode_multipart_data (data, files))
  File "load.py", line 49, in encode_multipart_data
    lines.extend (encode_field (name))
  File "load.py", line 34, in encode_field
    '', base64.b64encode(u"%s" % data[field_name]))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/base64.py", line 53, in b64encode
    encoded = binascii.b2a_base64(s)[:-1]
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)

Code:

def random_string (length):
    return ''.join (random.choice (string.ascii_letters) for ii in range (length + 1))


def encode_multipart_data (data, files):
    boundary = random_string (30)

    def get_content_type (filename):
      return mimetypes.guess_type (filename)[0] or 'application/octet-stream'

    def encode_field (field_name):
      return ('--' + boundary,
              'Content-Disposition: form-data; name="%s"' % field_name,
              'Content-Transfer-Encoding: base64',
              '', base64.b64encode(u"%s" % data[field_name]))

    def encode_file (field_name):
      filename = files [field_name]
      file_size = os.stat(filename).st_size
      file_data = open(filename, 'rb').read()
      file_b64 = base64.b64encode(file_data)
      return ('--' + boundary,
              'Content-Disposition: form-data; name="%s"; filename="%s"' % (field_name, filename),
              'Content-Type: %s' % get_content_type(filename),
              'Content-Transfer-Encoding: base64',
              '', file_b64)

    lines = []
    for name in data:
      lines.extend (encode_field (name))
    for name in files:
      lines.extend (encode_file (name))
    lines.extend (('--%s--' % boundary, ''))
    body = '\r\n'.join (lines)

    headers = {'content-type': 'multipart/form-data; boundary=' + boundary,
               'content-length': str(len(body))}

    return body, headers


def send_post (url, data, files={}):
    req = urllib2.Request (url)
    connection = httplib.HTTPConnection (req.get_host())
    connection.request ('POST', req.get_selector(), *encode_multipart_data (data, files))
    return connection.getresponse()

The beer object's json is (this is the data being passed into encode_multipart_data):

    {
    "name"        : "Yuengling Oktoberfest",
    "brewer"      : "Yuengling Brewery",
    "description" : "America’s Oldest Brewery is proud to offer Yuengling Oktoberfest Beer. Copper in color, this medium bodied beer is the perfect blend of roasted malts with just the right amount of hops to capture a true representation of the style. Enjoy a Yuengling Oktoberfest Beer in celebration of the season, while supplies last!",
    "abv"         : 5.2, 
    "ibu"         : 26, 
    "type"        : "Lager",
    "subtype"     : "",
    "color"       : "",
    "seasonal"    : true,
    "servingTemp" : "Cold",
    "rating"      : 3,
    "inProduction": true  
    }

Upvotes: 0

Views: 2288

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 178264

You can't base64-encode Unicode, only byte strings. In Python 2.7, giving a Unicode string to a function that requires a byte string causes an implicit conversion to a byte string using the ascii codec, resulting in the error you see:

>>> base64.b64encode(u'America\u2019s')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\base64.py", line 53, in b64encode
    encoded = binascii.b2a_base64(s)[:-1]
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)

So encode it to a byte string using a valid encoding first:

>>> base64.b64encode(u'America\u2019s'.encode('utf8'))
'QW1lcmljYeKAmXM='

Upvotes: 4

Related Questions