Gillespie
Gillespie

Reputation: 6561

Parse valid JSON object or array from a string

I have a string that can be one of two forms:

name multi word description {...}

or

name multi word description [...]

where {...} and [...] are any valid JSON. I am interested in parsing out just the JSON part of the string, but I'm not sure of the best way to do it (especially since I don't know which of the two forms the string will be). This is my current method:

import json

string = 'bob1: The ceo of the company {"salary": 100000}' 
o_ind = string.find('{')
a_ind = string.find('[')

if o_ind == -1 and a_ind == -1:
    print("Could not find JSON")
    exit(0)

index = min(o_ind, a_ind)
if index == -1:
    index = max(o_ind, a_ind)

json = json.loads(string[index:])
print(json)

It works, but I can't help but feel like it could be done better. I thought maybe regex, but I was having trouble with it matching sub objects and arrays and not the outermost json object or array. Any suggestions?

Upvotes: 5

Views: 22358

Answers (2)

midori
midori

Reputation: 4837

You would use simple | in regex matching both needed substrings:

import re
import json

def json_from_s(s):
    match = re.findall(r"{.+[:,].+}|\[.+[,:].+\]", s)
    return json.loads(match[0]) if match else None

And some tests:

print json_from_s('bob1: The ceo of the company {"salary": 100000}')
print json_from_s('bob1: The ceo of the company ["salary", 100000]')
print json_from_s('bob1')
print json_from_s('{1:}')
print json_from_s('[,1]')

Output:

{u'salary': 100000}
[u'salary', 100000]
None
None
None

Upvotes: 4

alecxe
alecxe

Reputation: 473863

You can locate the start of the JSON by checking the presence of { or [ and then save everything to the end of the string into a capturing group:

>>> import re
>>> string1 = 'bob1: The ceo of the company {"salary": 100000}'
>>> string2 = 'bob1: The ceo of the company ["10001", "10002"]'
>>> 
>>> re.search(r"\s([{\[].*?[}\]])$", string1).group(1)
'{"salary": 100000}'
>>> re.search(r"\s([{\[].*?[}\]])$", string2).group(1)
'["10001", "10002"]'

Here the \s([{\[].*?[}\]])$ breaks down to:

  • \s - a single space character
  • parenthesis is a capturing group
  • [{\[] would match a single { or [ (the latter needs to be escaped with a backslash)
  • .*? is a non-greedy match for any characters any number of times
  • [}\]] would match a single } and ] (the latter needs to be escaped with a backslash)
  • $ means the end of the string

Or, you may use re.split() to split the string by a space followed by a { or [ (with a positive look ahead) and get the last item. It works for the sample input you've provided, but not sure if this is reliable in general:

>>> re.split(r"\s(?=[{\[])", string1)[-1]
'{"salary": 100000}'
>>> re.split(r"\s(?=[{\[])", string2)[-1]
'["10001", "10002"]'

Upvotes: 10

Related Questions