Howard
Howard

Reputation: 19805

Parsing invalid Unicode JSON in Python

i have a problematic json string contains some funky unicode characters

"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}

and if I convert using python

import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
json.loads(s) 
# Error..

If I can accept to skip/lose the value of these unicode characters, what is the best way to make my json.loads(s) works?

Upvotes: 0

Views: 2127

Answers (3)

darthn
darthn

Reputation: 153

I'm a bit late for the party, but we were seeing a similar issue, to be precise this one Logstash JSON input with escaped double quote, just for \xXX.

There JS.stringify created such (per specification) invalid json texts.

The solution is to simply replace the \x by \u00, as unicode escaped characters are allowed, while ASCII escaped characters are not.

import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
s = s.replace("\\x", "\\u00")
json.loads(s)

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1121266

You don't have JSON; that can be interpreted directly as Python instead. Use ast.literal_eval():

>>> import ast
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> ast.literal_eval(s)
{'test': {'foo': 'Ig0s\\/k\\/4jRk'}}

The \x5C is a single backslash, doubled in the Python literal string representation here. The actual string value is:

>>> print _['test']['foo']
Ig0s\/k\/4jRk

This parses the input as Python source, but only allows for literal values; strings, None, True, False, numbers and containers (lists, tuples, dictionaries).

This method is slower than json.loads() because it does part of the parse-tree processing in pure Python code.

Another approach would be to use a regular expression to replace the \xhh escape codes with JSON \uhhhh codes:

import re

escape_sequence = re.compile(r'\\x([a-fA-F0-9]{2})')

def repair(string):
    return escape_sequence.sub(r'\\u00\1', string)

Demo:

>>> import json
>>> json.loads(repair(s))
{u'test': {u'foo': u'Ig0s\\/k\\/4jRk'}}

If you can repair the source producing this value to output actual JSON instead that'd be a much better solution.

Upvotes: 1

jfs
jfs

Reputation: 414079

If the rest of the string apart from invalid \x5c is a JSON then you could use string-escape encoding to decode `'\x5c into backslashes:

>>> import json
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> json.loads(s.decode('string-escape')) 
{u'test': {u'foo': u'Ig0s/k/4jRk'}}

Upvotes: 1

Related Questions