Reputation: 907
The following is a example json string which needs to be parsed-
'{
"name":"bla",
"quote":"bla bla "blah blah" bla",
"occupation":"blabla"
}'
I need to insert automatically \\
escape the two quotes to parse. I followed this. But the problem is, it splits the string by :
, because it assumes that the json string has only one key value. Moreover, i also can not split by ,
because the quote
section can contain ,
in its text. for example "quote":"bla bla, "blah blah" bla"
. So, in contrast to that answer, i need a more robust solution. How can I do this? I can not think of any modification of that answer, that will work perfectly in my case.
Upvotes: 1
Views: 374
Reputation: 7952
Given two (pretty big) premises, it can still be parsed:
{
, }
, & :
are not valid values for any of the fields{}
has 3 parts2 could be removed if you generalized the parsing of key/value pairs if the number varies.
1 can also be slightly relaxed, if you can say that :
can only appear in the value fields (in any quantity). This would be mutually exclusive to #2, however.
The regex:
{\"([^\"]*)\":\"([^:\n\r]*)\",?\"([^\"]*)\":\"([^:\n\r]*)\",?\"([^\"]*)\":\"([^:\n\r]*)\",?}
or, as a raw string:
r'{"([^"]*)":"([^:\n\r]*)",?"([^"]*)":"([^:\n\r]*)",?"([^"]*)":"([^:\n\r]*)",?}'
NOTE: This doesn't handle any whitespace in the JSON portions, but that can be added if necessary, it's just pretty long already/
Usage:
pattern = r'{"([^"]*)":"([^:\n\r]*)",?"([^"]*)":"([^:\n\r]*)",?"([^"]*)":"([^:\n\r]*)",?}'
matches = re.findall(pattern, input)
for match in matches:
result = {match[0]: match[1], match[2]: match[3], match[4]: match[5]}
# Do something with each result
In use:
>>> pattern = '{\"([^\"]*)\":\"([^:\n\r]*)\",?\"([^\"]*)\":\"([^:\n\r]*)\",?\"([^\"]*)\":\"([^:\n\r]*)\",?}'
>>> matches = re.findall(pattern, input)
>>> for match in matches:
result = {match[0]: match[1], match[2]: match[3], match[4]: match[5]}
>>> result
{'quote': 'bla bla "blah blah" bla', 'name': 'bla', 'occupation': 'blabla'}
Another example:
>>> input = """{"name":"b"testst,s'''""'''''''t""e,"la","quote":"bla bla "blah b,lah" bla","occupation":"bl,,,abla"}"""
>>> matches = re.findall(pattern, input)
>>> for match in matches:
result = {match[0]: match[1], match[2]: match[3], match[4]: match[5]}
>>> result
{'quote': 'bla bla "blah b,lah" bla', 'name': 'b"testst,s\'\'\'""\'\'\'\'\'\'\'t""e,"la', 'occupation': 'bl,,,abla'}
Upvotes: 0
Reputation: 3525
This is definitely a malformed json and there's no robust way of parsing it covering all the possible cases.
If you know that this is the structure of every line, you can try by splitting in a more convoluted way, like with ":"
but this is not reliable. An alternative would be to use a regex but it's more complicated and it may suffer from the same problems.
The best solution would be to go to the person who created this JSON, slap him in the face, and ask him to re-encode the file, but I imagine this is not possible atm.
Upvotes: 2