Reputation: 6561
I have a string that can be one of two forms:
name multi word description {...}
or
name multi word description [...]
where {...}
and [...]
are any valid JSON. I am interested in parsing out just the JSON part of the string, but I'm not sure of the best way to do it (especially since I don't know which of the two forms the string will be). This is my current method:
import json
string = 'bob1: The ceo of the company {"salary": 100000}'
o_ind = string.find('{')
a_ind = string.find('[')
if o_ind == -1 and a_ind == -1:
print("Could not find JSON")
exit(0)
index = min(o_ind, a_ind)
if index == -1:
index = max(o_ind, a_ind)
json = json.loads(string[index:])
print(json)
It works, but I can't help but feel like it could be done better. I thought maybe regex, but I was having trouble with it matching sub objects and arrays and not the outermost json object or array. Any suggestions?
Upvotes: 5
Views: 22358
Reputation: 4837
You would use simple |
in regex matching both needed substrings:
import re
import json
def json_from_s(s):
match = re.findall(r"{.+[:,].+}|\[.+[,:].+\]", s)
return json.loads(match[0]) if match else None
And some tests:
print json_from_s('bob1: The ceo of the company {"salary": 100000}')
print json_from_s('bob1: The ceo of the company ["salary", 100000]')
print json_from_s('bob1')
print json_from_s('{1:}')
print json_from_s('[,1]')
Output:
{u'salary': 100000}
[u'salary', 100000]
None
None
None
Upvotes: 4
Reputation: 473863
You can locate the start of the JSON by checking the presence of {
or [
and then save everything to the end of the string into a capturing group:
>>> import re
>>> string1 = 'bob1: The ceo of the company {"salary": 100000}'
>>> string2 = 'bob1: The ceo of the company ["10001", "10002"]'
>>>
>>> re.search(r"\s([{\[].*?[}\]])$", string1).group(1)
'{"salary": 100000}'
>>> re.search(r"\s([{\[].*?[}\]])$", string2).group(1)
'["10001", "10002"]'
Here the \s([{\[].*?[}\]])$
breaks down to:
\s
- a single space character[{\[]
would match a single {
or [
(the latter needs to be escaped with a backslash).*?
is a non-greedy match for any characters any number of times[}\]]
would match a single }
and ]
(the latter needs to be escaped with a backslash)$
means the end of the stringOr, you may use re.split()
to split the string by a space followed by a {
or [
(with a positive look ahead) and get the last item. It works for the sample input you've provided, but not sure if this is reliable in general:
>>> re.split(r"\s(?=[{\[])", string1)[-1]
'{"salary": 100000}'
>>> re.split(r"\s(?=[{\[])", string2)[-1]
'["10001", "10002"]'
Upvotes: 10