Reputation: 17601
Python 2.7.3
I have read all related threads around json/dumps UnicodeDecodeError and most of them want me to understand what encoding I need. In my case I am creating a json with various key values coming from various services (some p4 command lines) possibly different encoding. I have a map something like this
map = {"system1": some_data_from_system1, "system2", some_data_from_system2}
json.dumps(map)
This throws an "UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 737: ordinal not in range(128)"
I would like to have ASCII characters dumped into a file occasionally a p4 checkin/jira might have non-ascii chars and it is perfectly okay to ignore this. I have tried "ensure_ascii = False" and it does not solve the problem. What I really want is the Encoder to simply ignore any non-ascii chars on the way. I think this is reasonable but cannot find any way out.
Suggestions?
Upvotes: 0
Views: 5627
Reputation: 1
If your string in json format has non ASCII characters and you need to use accordingly python's dump method:
myString = "{key: 'Brazilian Portuguese has many differenct characters like maçã (apple) or Bíblia (Blible)' }" # or a map
myJSON = json.dumps(myString, encoding="latin-1") #use utf8 if appropriate
myJSON = json.loads(myJSON)
Upvotes: 0
Reputation: 17601
I have used a combination of How to get string objects instead of Unicode ones from JSON in Python? and the answer above to do this piece of logging.
As stated above the some_data_from_system{1|2}
are not strings. The question is about a general error logging system. When things go wrong you want to dump as much information from several subsystems as possible for human inspection. The subsystems change between environments and it is not always known what encoding these use when they return "jsons" representing what went/was wrong. To this effect I have the following method stolen from the other thread but the essence basically is the decode
method with the "ignore"
PLEASE NOTE: This is not a very performant method (most blind recursions are usually not). So this is not suitable for a typical production application; Depending upon the data it is possibly to run into an infinite loop. However assuming you understand the disclaimers it is okay for error logging systems.
def convert_encoding(data, encoding = 'ascii'):
if isinstance(data, dict):
return dict((convert_encoding(key), convert_encoding(value)) \
for key, value in data.iteritems())
elif isinstance(data, list):
return [convert_encoding(element) for element in data]
elif isinstance(data, unicode):
return data.encode(encoding, 'ignore')
else:
return data
map = {"system1": some_data_from_system1, "system2", some_data_from_system2}
json.dumps(convert_encoding(map), ensure_ascii = False)
Once done this generic method can be used to dump data.
Upvotes: 0
Reputation: 1121924
The json.dumps()
and json.dump()
functions will try to decode byte strings to Unicode values when passed in, using UTF-8 by default:
>>> map = {"system1": '\x92'}
>>> json.dumps(map)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
>>> map = {"system1": u'\x92'.encode('utf8')}
>>> json.dumps(map)
'{"system1": "\\u0092"}'
You can set the encoding
keyword argument to use a different encoding for byte string (str
) characters.
These functions do this because JSON is a standard that uses Unicode for all strings. If you feed it data that is not encoded as UTF-8, this fails, as shown above.
On Python 2 the output is a byte string too, encoded to UTF-8. IT can be safely written to a file. Setting the ensure_ascii
argument to False
would change that and you'd get Unicode instead, which you clearly don't want.
So you need to ensure that what you put into the json.dumps()
function is consistently all the same encoding, or is already decoded to unicode
objects. If you don't care about the occasional missed codepoint, you'd do so with forcing a decode with the error handler set to replace
or ignore
:
map = {"system1": some_data_from_system1.decode('ascii', errors='ignore')}
This decodes the string forcibly, replacing any bytes that are not recognized as ASCII codepoints with a replacement character:
>>> '\x92'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 0: ordinal not in range(128)
>>> '\x92'.decode('ascii', errors='replace')
u'\ufffd'
Here a U+FFFD REPLACEMENT CHARACTER codepoint is inserted instead to represent the unknown codepoint. You could also completely ignore such bytes by using errors='ignore'
.
Upvotes: 2