Reputation: 356
I'm trying to find a regular expression which matches repeated keys on different levels of a nested JSON string representation. All my "solutions" suffer from catastrophic backtracking so far.
An example of that JSON string looks like this:
d = {
"a": {
"b": {
"c": {
"d": "v1",
"key": "v2"
}
},
"c": {
"g": "v3",
"key": "v4"
},
"key": "v5"
}
}
The value of key
is the target. My application does have all object names leading to that key. With these names I can use a for loop to construct my final regex. So basically I need the parts to put in between.
Example:
If I get "a"
and "key"
I could construct the following: "a"[^}]*"key"
. This matches the first "key" in my string d
, the one with value v2.
What should happen though, is that "a"
+ "key"
matches the key with value v5. The key with value v2 should be match when the full path "a"
+ "b"
+ "c"
+ "key"
comes in. The last case in this example would be matching the key with value v4 when "a"
+ "c"
+ "key"
is given.
So a complete regex for the last one would look similar to this:
"a"MATCH_EVERYTHING_IN_BETWEEN_REGEX"c"MATCH_EVERYTHING_IN_BETWEEN_REGEX"key":\s*(\[[^}]*?\]|".*?"|\d+\.*\d*)
To be clear, I am looking for this MATCH_EVERYTHING_IN_BETWEEN_REGEX expression which I can plug in as connectors. This is to make sure it matches only the key I have received the path for. The JSON string could be infinitely nested.
Here is an online regex tester with the example: https://regex101.com/r/yNZ3wo/2
Note:
I know this is not python specific but I'm also grateful about python hints in this context. I thought about building my own parser, using a stack and counting {
and }
but before I would like to make sure there is no easy regex solution.
EDIT: I know about the json library but this doesn't solve my case since I'm tracking the coordinates of my targets within the string representation inside an editor window. I'm not looking for the values themselves, I can access them from an associated dictionary.
Upvotes: 3
Views: 5502
Reputation: 356
Thanks to the answer provided by wp78de I realized that regex in this case is not the right tool for the job, at least not the one I wanted. Maybe this is of use for someone else, that's why I'm adding this here.
So, I wrote a function which solves the problem recursively.
I made use of the fact that I know which key has to be matched at which level, so it only increments the key index (ind) when this is the case. Other keys which are not matched by name and level together trigger an exception. The if clauses at the end take care of the nesting level.
As a first step I convert the string into a list of lines (with preceding blanks stripped):
d = \
{
"a": {
"b": {
"c": {
"d": "v1",
"key": "v2" # line 6
}
},
"x": {
"c": {
"d": "v11",
"key": "v20" # line 12
}
},
"c": {
"g": "v3",
"key": "v4" # line 17
},
"key": "v5" # line 19
}
}
ds = json.dumps(d, indent=4)
l = ds.split('\n')
ll = [x.lstrip() for x in l]
def findkey(l, t, lev=0, ind=0):
if ind == len(t):
return 1
else:
el = l[0]
try:
if el.startswith(t[ind]) and t.index(t[ind]) == lev:
ind += 1
except IndexError as e:
pass
if "{" in el:
lev += 1
if "}" in el:
lev -= 1
return 1 + findkey(l[1:], t, lev, ind)
The above only returns the line number but now I can match my target with a very simple regex:
idx = findkey(ll[1:], tup) - 1
s = re.compile(tup[-1] + ': (\s*(\[[^}]*?\]|".*?"|\d+\.*\d*))', re.DOTALL)
match = s.search(l[idx])
print("Value found at start index: {}, stop index: {}".format(match.start(1), match.end(2)))
Output:
Value found at start index: 19, stop index: 23
Here is a pyfiddle:
Upvotes: 0
Reputation: 18950
This is hard. A possible solution is to use
(?<="a": )({(?>[^{}]|(?1))*})
({(?>[^{}]|(?1))*})|"key":\s*"([^"]*?)"
Code sample:
import regex as re
test_str = ("{ \n"
" \"a\": { \n"
" \"b\": { \n"
" \"c\": { \n"
" \"d\": \"v1\", \n"
" \"key\": \"v2\" \n"
" } \n"
" }, \n"
" \"c\": { \n"
" \"g\": \"v3\", \n"
" \"key\": \"v4\" \n"
" }, \n"
" \"key\": \"v5\" \n"
" } \n"
" } \n"
"} \n")
regex = r"(?<=\"a\": )({(?>[^{}]|(?1))*})"
innerRegex = r"({(?>[^{}]|(?1))*})|\"key\":\s*\"([^\"]*?)\""
matches = re.finditer(regex, test_str, re.DOTALL)
for n, match in enumerate(matches):
n = n + 1
#print ("Match {n} was found at {start}-{end}: {match}".format(n = n, start = match.start(), end = match.end(), match = match.group()))
inner = match.group()[1:-1]
innerMatches = re.finditer(innerRegex, inner, re.DOTALL)
for m, innerMatch in enumerate(innerMatches):
#m = m + 1
if (innerMatch.groups()[1] is not None):
print ("Found at {start}-{end}: {group}".format(start = innerMatch.start(2), end = innerMatch.end(2), group = innerMatch.group(2)))
or continue the search on the next level (not shown in the above) code.
Basically, you would continue from the inner
match again from step 1 in the same way (see demo), e.g.:
(?<="c": )({(?>[^{}]|(?1))*})
This should give you head-start.
*Since we use regex recursion, we need the alternative Python regex package.
Upvotes: 1