user6471969
user6471969

Reputation: 21

Unicode API response throwing error ''ascii' codec can't encode character u'\u2019' in position 22462'

I am making an API call and the response has unicode characters. Loading this response into a file throws the following error:

'ascii' codec can't encode character u'\u2019' in position 22462

I've tried all combinations of decode and encode ('utf-8').

Here is the code:

url = "https://%s?start_time=%s&include=metric_sets,users,organizations,groups" % (api_path, start_epoch)
while url != None and url != "null" :
json_filename = "%s/%s.json" % (inbound_folder, start_epoch)
try:
    resp = requests.get(url,
                        auth=(api_user, api_pwd),
                        headers={'Content-Type': 'application/json'})

except requests.exceptions.RequestException as e:
    print "|********************************************************|"
    print e
    return "Error: {}".format(e)
    print "|********************************************************|"
    sys.exit(1)

try:
    total_records_extracted = total_records_extracted + rec_cnt
    jsonfh = open(json_filename, 'w')
    inter = resp.text
    string_e = inter#.decode('utf-8')
    final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
    encoded_data = final.encode('utf-8')
    cleaned_data = json.loads(encoded_data)
    json.dump(cleaned_data, jsonfh, indent=None)
    jsonfh.close()
except ValueError as e:
    tb = traceback.format_exc()
    print tb
    print "|********************************************************|"
    print  e
    print "|********************************************************|"
    sys.exit(1)

Lot of developers have faced this issue. a lot of places have asked to use .decode('utf-8') or having a # _*_ coding:utf-8 _*_ at the top of python.

It is still not helping.

Can someone help me with this issue?

Here is the trace:

Traceback (most recent call last):
File "/Users/SM/PycharmProjects/zendesk/zendesk_tickets_api.py", line 102, in main
cleaned_data = json.loads(encoded_data)
File "/Users/SM/anaconda/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 2826494 (char 2826493)

|********************************************************|
Invalid \escape: line 1 column 2826494 (char 2826493)

Upvotes: 1

Views: 1454

Answers (1)

bobince
bobince

Reputation: 536479

inter = resp.text
string_e = inter#.decode('utf-8')
encoded_data = final.encode('utf-8')

The text property is a Unicode character string, decoded from the original bytes using whatever encoding the Requests module guessed might be in use from the HTTP headers.

You probably don't want that; JSON has its own ideas about what the encoding should be, so you should let the JSON decoder do that by taking the raw response bytes from resp.content and passing them straight to json.loads.

What's more, Requests has a shortcut method to do the same: resp.json().

final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')

Trying to do this on the JSON-string-literal formatted input is a bad idea: you will miss some valid escapes, and incorrectly unescape others. Your actual error is nothing to do with Unicode at all, it's that this replacement is mangling the input. For example consider the input JSON:

{"message": "Open the file C:\\newfolder\\text.txt"}

after replacement:

{"message": "Open the file C:\ ewfolder\ ext.txt"}

which is clearly not valid JSON.

Instead of trying to operate on the JSON-encoded string, you should let json decode the input and then filter any strings you have in the structured output. This may involve using a recursive function to walk down into each level of the data looking for strings to filter. eg

def clean(data):
    if isinstance(data, basestring):
        return data.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
    if isinstance(data, list):
        return [clean(item) for item in data]
    if isinstance(data, dict):
        return {clean(key): clean(value) for (key, value) in data.items()}
    return data

cleaned_data = clean(resp.json())

Upvotes: 1

Related Questions