Reputation: 21
I am making an API call and the response has unicode characters. Loading this response into a file throws the following error:
'ascii' codec can't encode character u'\u2019' in position 22462
I've tried all combinations of decode and encode ('utf-8').
Here is the code:
url = "https://%s?start_time=%s&include=metric_sets,users,organizations,groups" % (api_path, start_epoch)
while url != None and url != "null" :
json_filename = "%s/%s.json" % (inbound_folder, start_epoch)
try:
resp = requests.get(url,
auth=(api_user, api_pwd),
headers={'Content-Type': 'application/json'})
except requests.exceptions.RequestException as e:
print "|********************************************************|"
print e
return "Error: {}".format(e)
print "|********************************************************|"
sys.exit(1)
try:
total_records_extracted = total_records_extracted + rec_cnt
jsonfh = open(json_filename, 'w')
inter = resp.text
string_e = inter#.decode('utf-8')
final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
encoded_data = final.encode('utf-8')
cleaned_data = json.loads(encoded_data)
json.dump(cleaned_data, jsonfh, indent=None)
jsonfh.close()
except ValueError as e:
tb = traceback.format_exc()
print tb
print "|********************************************************|"
print e
print "|********************************************************|"
sys.exit(1)
Lot of developers have faced this issue. a lot of places have asked to use .decode('utf-8')
or having a # _*_ coding:utf-8 _*_
at the top of python.
It is still not helping.
Can someone help me with this issue?
Here is the trace:
Traceback (most recent call last):
File "/Users/SM/PycharmProjects/zendesk/zendesk_tickets_api.py", line 102, in main
cleaned_data = json.loads(encoded_data)
File "/Users/SM/anaconda/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 2826494 (char 2826493)
|********************************************************|
Invalid \escape: line 1 column 2826494 (char 2826493)
Upvotes: 1
Views: 1454
Reputation: 536479
inter = resp.text
string_e = inter#.decode('utf-8')
encoded_data = final.encode('utf-8')
The text
property is a Unicode character string, decoded from the original bytes using whatever encoding the Requests module guessed might be in use from the HTTP headers.
You probably don't want that; JSON has its own ideas about what the encoding should be, so you should let the JSON decoder do that by taking the raw response bytes from resp.content
and passing them straight to json.loads
.
What's more, Requests has a shortcut method to do the same: resp.json()
.
final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
Trying to do this on the JSON-string-literal formatted input is a bad idea: you will miss some valid escapes, and incorrectly unescape others. Your actual error is nothing to do with Unicode at all, it's that this replacement is mangling the input. For example consider the input JSON:
{"message": "Open the file C:\\newfolder\\text.txt"}
after replacement:
{"message": "Open the file C:\ ewfolder\ ext.txt"}
which is clearly not valid JSON.
Instead of trying to operate on the JSON-encoded string, you should let json
decode the input and then filter any strings you have in the structured output. This may involve using a recursive function to walk down into each level of the data looking for strings to filter. eg
def clean(data):
if isinstance(data, basestring):
return data.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
if isinstance(data, list):
return [clean(item) for item in data]
if isinstance(data, dict):
return {clean(key): clean(value) for (key, value) in data.items()}
return data
cleaned_data = clean(resp.json())
Upvotes: 1