Edward Yu
Edward Yu

Reputation: 418

UnicodeEncodeError with JSON data

I have a JSON object with UTF8 characters in them. When I try to print the object to the console (in Windows 8.1), it throws this error: UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 3706: character maps to <undefined> because the console doesn't support display of some UTF8 characters. I checked this answer but none of the solutions work, because a JSON object cannot be encoded and decoded. How to fix encoding issues for JSON?

def getTweets(self, company):
    #params
    baseUrl = 'https://api.twitter.com/1.1/search/tweets.json'
    values = {'q' : company, 'result_type' : 'recent', 'count' : 100}
    params = urllib.parse.urlencode(values)
    url = baseUrl + '?' + params
    #headers
    authorization = 'Bearer %s' % self.bearer 
    acceptEncoding = 'gzip'
    headers = {'User-Agent' : self.userAgent, 'Authorization' : authorization, 'Accept-Encoding' : acceptEncoding}
    req = urllib.request.Request(url, None, headers)
    response = urllib.request.urlopen(req)
    rawData = response.read()
    decompressedData = zlib.decompress(rawData, 16+zlib.MAX_WBITS)      
    decompressedData = decompressedData.decode('utf-8')
    #print(decompressedData)
    jsonData = json.loads(decompressedData)
    print(jsonData)

Upvotes: 0

Views: 1165

Answers (1)

phobic
phobic

Reputation: 1013

You say that your console does not support UTF-8. So you need to use another encoding. I will try to explain how encode, decode and print work together, leading to your exception; With decode(encoding) you transform a byte string to a unique unicode representation. You specify the encoding because without it a byte could be mapped to virtually any character. You need to know the encoding of your data you get from the website, though it usually is UTF-8.

The first step, when you get text from outside your application is to get a unique unicode representation, so that you don't need to remember the encoding of each text in your application.

When printing unicode with the print statement, it assumes that you use the standard encoding, though you can specify a different standard encoding. The error means that print tries to use the standard encoding on your unicode text but fails because it cannot encode a character outside of its defined range to a byte representation.

The standard encoding is:

print sys.stdout.encoding

When giving text from your application to another application or when you want to store the text, you need to encode it to a byte representation. So when you give your unicode string to the console, you need to transform it to a byte representation with the encoding that it expects. For the console I guess that it expects the bytes from your application to be in the standard encoding.

So, to print the unicode string, you can use encode on your unicode string to transform it to a byte representation that your console can handle. For instance, you can transform them to an ascii byte representation and replace characters outside of ascii's defined range with question marks:

# bytes to unicode
decompressedData_unicode = decompressedData.decode('utf-8')
# unicode to bytes
decompressedData_string = decompressedData_unicode.encode('ascii', 'replace')
# hope that the consoles standard encoding is compatible with ascii
print decompressedData_string

If the console allows for another encoding, you can set it as a standard encoding and print the unicode string directly, or do:

decompressedData_string = decompressedData_unicode.encode('standard encoding', 'replace')
print decompressedData_string

and hope that the standard encoding can represent each of the unicode characters in decompressedData_unicode.

Upvotes: 1

Related Questions