Python Skip/Delete Non-Decodable Characters

Question

I'm currently converting some customer-entered strings as part of a json. I've already made a dict out of the strings, and now am just doing:

json.dumps(some_dict)

Problem is, for some of the customer entered data, it seems they've somehow entered garbled stuff and trying to dump to json breaks the whole thing:

{'FIRST_NAME': 'sdffg\xed', 'LAST_NAME': 'sdfsadf'}

Which then gets me:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)

I have no control over where the data comes from, so I can't prevent this upfront. So, now that this bad data already exists, I was thinking of just replacing unknown/bad characters with some placeholder character, or deleting them. How can I do this?

bobince · Accepted Answer

{'FIRST_NAME': 'sdffg\xed', 'LAST_NAME': 'sdfsadf'}

is a Python dictionary whose keys and values are byte strings. That cannot be represented in JSON because JSON don't have any concept of bytes. JSON string values are always Unicode, so to faithfully reproduce a Python dict, you have to make sure all the textual keys and values are unicode (u'...') strings.

Python will let you get away with 'FIRST_NAME' because it is limited to plain ASCII; most popular byte encodings are ASCII supersets, so Python can reasonably safely implicitly decode the string as ASCII. But that is not the case for strings with bytes outside the range 0x00-0x7F, such as 'sdffg\xed'. You should .decode the byte str to a unicode string before putting it in the dictionary. (Really you should try to ensure that your textual data is kept in Unicode strings for all of your application processing, converting to byte strings only when input is loaded from a non-Unicode source and when output has to go to a non-Unicode destination. So you shouldn't have ended up with byte content in a dictionary at this point. Check where that input is coming from - you probably should be doing the decode() step further up.)

You can decode to Unicode and skip or replace non-ASCII characters by using:

>>> 'sdffg\xed'.decode('ascii', 'ignore')
u'sdffg'
>>> 'sdffg\xed'.decode('ascii', 'replace')
u'sdffg\uFFFD' # U+FFFD = �. Unicode string, json.dump can serialise OK

but it seems a shame to throw away potentially useful data. If you can guess the encoding that was used to create the byte string you can keep the subset of non-ASCII characters that are recoverable. If the byte 0xED represents the character U+00ED i-acute (í), then .decode('iso-8859-1') or possibly .decode('cp1252') may be the encoding you are looking for.

Python Skip/Delete Non-Decodable Characters

Answers (2)

Related Questions