Reputation: 19805
i have a problematic json string contains some funky unicode characters
"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}
and if I convert using python
import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
json.loads(s)
# Error..
If I can accept to skip/lose the value of these unicode characters, what is the best way to make my json.loads(s)
works?
Upvotes: 0
Views: 2127
Reputation: 153
I'm a bit late for the party, but we were seeing a similar issue, to be precise this one Logstash JSON input with escaped double quote, just for \xXX.
There JS.stringify
created such (per specification) invalid json texts.
The solution is to simply replace the \x
by \u00
, as unicode escaped characters are allowed, while ASCII escaped characters are not.
import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
s = s.replace("\\x", "\\u00")
json.loads(s)
Upvotes: 0
Reputation: 1121266
You don't have JSON; that can be interpreted directly as Python instead. Use ast.literal_eval()
:
>>> import ast
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> ast.literal_eval(s)
{'test': {'foo': 'Ig0s\\/k\\/4jRk'}}
The \x5C
is a single backslash, doubled in the Python literal string representation here. The actual string value is:
>>> print _['test']['foo']
Ig0s\/k\/4jRk
This parses the input as Python source, but only allows for literal values; strings, None
, True
, False
, numbers and containers (lists, tuples, dictionaries).
This method is slower than json.loads()
because it does part of the parse-tree processing in pure Python code.
Another approach would be to use a regular expression to replace the \xhh
escape codes with JSON \uhhhh
codes:
import re
escape_sequence = re.compile(r'\\x([a-fA-F0-9]{2})')
def repair(string):
return escape_sequence.sub(r'\\u00\1', string)
Demo:
>>> import json
>>> json.loads(repair(s))
{u'test': {u'foo': u'Ig0s\\/k\\/4jRk'}}
If you can repair the source producing this value to output actual JSON instead that'd be a much better solution.
Upvotes: 1
Reputation: 414079
If the rest of the string apart from invalid \x5c
is a JSON then you could use string-escape
encoding to decode `'\x5c into backslashes:
>>> import json
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> json.loads(s.decode('string-escape'))
{u'test': {u'foo': u'Ig0s/k/4jRk'}}
Upvotes: 1