Parse valid JSON object or array from a string

Question

I have a string that can be one of two forms:

name multi word description {...}

or

name multi word description [...]

where {...} and [...] are any valid JSON. I am interested in parsing out just the JSON part of the string, but I'm not sure of the best way to do it (especially since I don't know which of the two forms the string will be). This is my current method:

import json

string = 'bob1: The ceo of the company {"salary": 100000}' 
o_ind = string.find('{')
a_ind = string.find('[')

if o_ind == -1 and a_ind == -1:
    print("Could not find JSON")
    exit(0)

index = min(o_ind, a_ind)
if index == -1:
    index = max(o_ind, a_ind)

json = json.loads(string[index:])
print(json)

It works, but I can't help but feel like it could be done better. I thought maybe regex, but I was having trouble with it matching sub objects and arrays and not the outermost json object or array. Any suggestions?

alecxe · Accepted Answer

You can locate the start of the JSON by checking the presence of { or [ and then save everything to the end of the string into a capturing group:

>>> import re
>>> string1 = 'bob1: The ceo of the company {"salary": 100000}'
>>> string2 = 'bob1: The ceo of the company ["10001", "10002"]'
>>> 
>>> re.search(r"\s([{$$].*?[}$$])$", string1).group(1)
'{"salary": 100000}'
>>> re.search(r"\s([{$$].*?[}$$])$", string2).group(1)
'["10001", "10002"]'

Here the \s([{$$].*?[}$$])$ breaks down to:

\s - a single space character
parenthesis is a capturing group
[{$$] would match a single { or [ (the latter needs to be escaped with a backslash)
.*? is a non-greedy match for any characters any number of times
[}$$] would match a single } and ] (the latter needs to be escaped with a backslash)
$ means the end of the string

Or, you may use re.split() to split the string by a space followed by a { or [ (with a positive look ahead) and get the last item. It works for the sample input you've provided, but not sure if this is reliable in general:

>>> re.split(r"\s(?=[{\[])", string1)[-1]
'{"salary": 100000}'
>>> re.split(r"\s(?=[{\[])", string2)[-1]
'["10001", "10002"]'

Parse valid JSON object or array from a string

Answers (2)

Related Questions