Reputation: 711
I am getting an ill-formatted json , as the key "text" can have users comments, so I need to fix the issue with Json (with double quotes)
{"test":[{"id":"1234","user":{"id":"1234"},"text":"test, "." test " 1234"","created":"2019-01-09"}]}
Tried below from another thread but not able to make it work.
import json, re
while True:
try:
result = json.loads(test.json) # try to parse...
break # parsing worked -> exit loop
except Exception as e:
# "Expecting , delimiter: line 34 column 54 (char 1158)"
# position of unexpected character after '"'
unexp = int(re.findall(r'\(char (\d+)\)', str(e))[0])
# position of unescaped '"' before that
unesc = s.rfind(r'"', 0, unexp)
s = s[:unesc] + r'\"' + s[unesc+1:]
# position of correspondig closing '"' (+2 for inserted '\')
closg = s.find(r'"', unesc + 2)
s = s[:closg] + r'\"' + s[closg+1:]
print result
Traceback (most recent call last):
File "test.py", line 10, in <module>
unexp = int(re.findall(r'\(char (\d+)\)', str(e))[0])
IndexError: list index out of range
Expected Result:(check text: key data with escaped double quotes)
Or we can remove all double quotes after "text": & before "created" & then enclose the value in "text": key with a starting & ending " which would solve my issue
{"test":[{"id":"1234","user":{"id":"1234"},"text":"test \".\" test \" 1234\"","created":"2019-01-09"}]}
or
{"test":[{"id":"1234","user":{"id":"1234"},"text":"test . test 1234","created":"2019-01-09"}]}
Upvotes: 0
Views: 659
Reputation: 13106
You just need to edit that one line, so you can use a regex to match it, edit the value, and then join it back with the rest of the json string for it to be parsed
import re
import json
json_str = '''{
"test": [
{
"id": "1234",
"user": {
"id": "1234"
},
"text": "test "." test " 1234"",
"created": "2019-01-09"
}
]
}'''
lines = []
# match the text key
text_line = re.compile('^\s+\"text\"')
for line in json_str.split('\n'):
# if a match happens, this will execute and fix the "text" line
if re.match(text_line, line):
k, v = line.split(':')
# the slice here is so that I don't escape the wrapping
# double quotes, which are the first and last chars of v
v = '"%s",' % v.strip()[1:-1].replace('"', '\\"')
line = '%s: %s' % (k, v)
# otherwise, carry on
lines.append(line)
print('\n'.join(lines))
{
"test": [
{
"id": "1234",
"user": {
"id": "1234"
},
"text": "test \".\" test \" 1234\"\"",
"created": "2019-01-09"
}
]
}
# Now you can parse it with json.loads
json.loads('\n'.join(lines))
{'test': [{'id': '1234', 'user': {'id': '1234'}, 'text': 'test "." test " 1234""', 'created': '2019-01-09'}]}
There is some optimization that can be done, but you can find all of the keys in your json using re
, and then parse it using a similar fashion as before:
import re
import json
# Now all one line
s = '''{"test":[{"id":"1234","user":{"id":"1234"},"text":"test, "." test " 1234"","created":"2019-01-09"}]}'''
# find our keys which will serve as our placeholders
keys = re.findall('\"\w+\"\:', s))
# ['"test":', '"id":', '"user":', '"id":', '"text":', '"created":']
# now we can find the indices for those keys to mark start
# and finish locations to extract the value
start, finish = s.index(keys[-2]), s.index(keys[-1])
k, v = s[start:finish].split(':')
# replace v as before
v = '"%s",' % v.strip()[1:-1].replace('"', '\\"')
# '"test, \\".\\" test \\" 1234\\"\\"",'
# replace string since it's immutable
s = s[:start] + '%s: %s' % (k, v) + s[finish:]
json.loads(s)
# {'test': [{'id': '1234', 'user': {'id': '1234'}, 'text': 'test, "." test " 1234""', 'created': '2019-01-09'}]}
As a note, this works for this particular use case, I can try to work out a more general approach, but this will at least get you off the ground
Upvotes: 1