Kannan Ekanath
Kannan Ekanath

Reputation: 17601

Python json Ignoring non-ascii chars UnicodeDecodeError: 'ascii' codec cant decode byte

Python 2.7.3

I have read all related threads around json/dumps UnicodeDecodeError and most of them want me to understand what encoding I need. In my case I am creating a json with various key values coming from various services (some p4 command lines) possibly different encoding. I have a map something like this

map = {"system1": some_data_from_system1, "system2", some_data_from_system2}
json.dumps(map)

This throws an "UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 737: ordinal not in range(128)"

I would like to have ASCII characters dumped into a file occasionally a p4 checkin/jira might have non-ascii chars and it is perfectly okay to ignore this. I have tried "ensure_ascii = False" and it does not solve the problem. What I really want is the Encoder to simply ignore any non-ascii chars on the way. I think this is reasonable but cannot find any way out.

Suggestions?

Upvotes: 0

Views: 5627

Answers (3)

tcruzfranca
tcruzfranca

Reputation: 1

If your string in json format has non ASCII characters and you need to use accordingly python's dump method:

myString = "{key: 'Brazilian Portuguese has many differenct characters like  maçã (apple) or Bíblia (Blible)' }" # or a map
myJSON = json.dumps(myString, encoding="latin-1") #use utf8 if appropriate
myJSON = json.loads(myJSON)

Upvotes: 0

Kannan Ekanath
Kannan Ekanath

Reputation: 17601

I have used a combination of How to get string objects instead of Unicode ones from JSON in Python? and the answer above to do this piece of logging.

As stated above the some_data_from_system{1|2} are not strings. The question is about a general error logging system. When things go wrong you want to dump as much information from several subsystems as possible for human inspection. The subsystems change between environments and it is not always known what encoding these use when they return "jsons" representing what went/was wrong. To this effect I have the following method stolen from the other thread but the essence basically is the decode method with the "ignore"

PLEASE NOTE: This is not a very performant method (most blind recursions are usually not). So this is not suitable for a typical production application; Depending upon the data it is possibly to run into an infinite loop. However assuming you understand the disclaimers it is okay for error logging systems.

def convert_encoding(data, encoding = 'ascii'):
    if isinstance(data, dict):
        return dict((convert_encoding(key), convert_encoding(value)) \
             for key, value in data.iteritems())
    elif isinstance(data, list):
        return [convert_encoding(element) for element in data]
    elif isinstance(data, unicode):
        return data.encode(encoding, 'ignore')
    else:
        return data

map = {"system1": some_data_from_system1, "system2", some_data_from_system2}
json.dumps(convert_encoding(map), ensure_ascii = False)

Once done this generic method can be used to dump data.

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1121924

The json.dumps() and json.dump() functions will try to decode byte strings to Unicode values when passed in, using UTF-8 by default:

>>> map = {"system1": '\x92'}
>>> json.dumps(map)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
>>> map = {"system1": u'\x92'.encode('utf8')}
>>> json.dumps(map)
'{"system1": "\\u0092"}'

You can set the encoding keyword argument to use a different encoding for byte string (str ) characters.

These functions do this because JSON is a standard that uses Unicode for all strings. If you feed it data that is not encoded as UTF-8, this fails, as shown above.

On Python 2 the output is a byte string too, encoded to UTF-8. IT can be safely written to a file. Setting the ensure_ascii argument to False would change that and you'd get Unicode instead, which you clearly don't want.

So you need to ensure that what you put into the json.dumps() function is consistently all the same encoding, or is already decoded to unicode objects. If you don't care about the occasional missed codepoint, you'd do so with forcing a decode with the error handler set to replace or ignore:

map = {"system1": some_data_from_system1.decode('ascii', errors='ignore')}

This decodes the string forcibly, replacing any bytes that are not recognized as ASCII codepoints with a replacement character:

>>> '\x92'.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 0: ordinal not in range(128)
>>> '\x92'.decode('ascii', errors='replace')
u'\ufffd'

Here a U+FFFD REPLACEMENT CHARACTER codepoint is inserted instead to represent the unknown codepoint. You could also completely ignore such bytes by using errors='ignore'.

Upvotes: 2

Related Questions