patrick
patrick

Reputation: 4852

Dealing with mis-escaped characters in JSON

I am reading a JSON file into Python which contains escaped single quotes (\'). This leads to all kinds of hiccups, as nicely discussed e.g. here. However, I could not find anything on how to address the issue. I just did a

newstring=originalstring.replace(r"\'", "'")

and things worked out. But this seems rather ugly. I could not really find much material on how to deal with this kind of thing (creating an exception, or something) in the json docs either.

Going back to the source is not possible, unfortunately.

Thanks for your help!

Upvotes: 2

Views: 7924

Answers (3)

Martijn Pieters
Martijn Pieters

Reputation: 1123450

The JSON standard defines specific set of valid 2-character escape sequences: \\, \/, \", \b, \r, \n, \f and \t, and one 4-character escape sequence to define any Unicode codepoint, \uhhhh (\u plus 4 hex digits). Any other sequence of backslash plus other character is invalid JSON.

If you have a JSON source you can't fix otherwise, the only way out is to remove the invalid sequences, like you did with str.replace() even if it is a little fragile (it'll break when there is an even backslash sequence preceding the quote).

You could use a regular expression too, where you remove any backslashes not used in a valid sequence:

fixed = re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'', inputstring)

This won't catch out an odd-count backslash sequence like \\\ but will catch anything else:

>>> import re, json
>>> broken = r'"JSON string with escaped quote: \' and various other broken escapes: \a \& \$ and a newline!\n"'
>>> json.loads(broken)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 34 (char 33)
>>> json.loads(re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'', broken))
"JSON string with escaped quote: ' and various other broken escapes: a & $ and a newline!\n"

Upvotes: 5

Alex Hall
Alex Hall

Reputation: 36043

The solution is not bad. It seems ugly because the problem is ugly - you have corrupt data. It's certainly simple, elegant, and effective. It will only fail if the substring \\' (that's three characters, I'm not escaping anything) is present anywhere, and even then only if the number of consecutive slashes is even. So your options are:

  1. Just do your current thing, but first check if r"\\'" in originalstring and throw an error if so. Easy, safe, probably fine.
  2. Use a regex with a negative lookbehind for (\\\\)+ or something.
  3. Catch errors and use the attributes of the errors to decide on a portion of the string to replace.

Check out this snippet:

import json
from json.decoder import JSONDecodeError

s = r'"\'"'
print(s)
try:
    print(json.loads(s))
except JSONDecodeError as e:
    print(vars(e))

Output:

"\'"
{'msg': 'Invalid \\escape', 'colno': 2, 'doc': '"\\\'"', 'pos': 1, 'lineno': 1}

Upvotes: 1

Barmar
Barmar

Reputation: 781751

The right thing would be to fix whatever is creating the invalid JSON file. But if that's not possible, I guess the replace is needed. But you should use a regular expression so it doesn't replace \\' with \' -- in this case the first backslash is escaping the second backslash, they're not escaping the quote. A negative lookbehind will prevent this.

import re
newstring = re.sub(r"(?<!\\)\\'", "'", originalstring)

Upvotes: 2

Related Questions