Reputation: 1897
I am trying to determine the fastest way to search a nested dictionary with a regular expression and to return the paths to each occurrence of that string. I am only interested in values that are strings, not other values that may not explicitly be strings. Recursion is not my strong suit. Here is an example JSON, let's say I'm looking for all absolute paths that contain 'blah'.
d = {'id': 'abcde',
'key1': 'blah',
'key2': 'blah blah',
'nestedlist': [{'id': 'qwerty',
'nestednestedlist': [{'id': 'xyz', 'keyA': 'blah blah blah'},
{'id': 'fghi', 'keyZ': 'blah blah blah'}],
'anothernestednestedlist': [{'id': 'asdf', 'keyQ': 'blah blah'},
{'id': 'yuiop', 'keyW': 'blah'}]}]}
I found the following code snippet, but am unsuccessful in making it return the paths, rather than just print them. Beyond that, it shouldn't be too hard to add an "IF value is a string AND contains re.search() THEN append the path to list".
def search_dict(v, prefix=''):
if isinstance(v, dict):
for k, v2 in v.items():
p2 = "{}['{}']".format(prefix, k)
search_dict(v2, p2)
elif isinstance(v, list):
for i, v2 in enumerate(v):
p2 = "{}[{}]".format(prefix, i)
search_dict(v2, p2)
else:
print('{} = {}'.format(prefix, repr(v)))
Upvotes: 2
Views: 682
Reputation: 135357
Both answers here eagerly compute the result, exhausting the entire input dict before returning the first (if any) available result. We can use yield from
to encode a more Pythonic program -
def search_substr(t = {}, q = ""):
def loop(t, path):
if isinstance(t, dict):
for k, v in t.items():
yield from loop(v, (*path, k)) # <- recur
elif isinstance(t, list):
for k, v in enumerate(t):
yield from loop(v, (*path, k)) # <- recur
elif isinstance(t, str):
if q in t:
yield path, t # <- output a match
yield from loop(t, ()) # <- init
for (path, value) in search_substr(d, "blah"):
print(path, value)
Result -
('key1',) blah
('key2',) blah blah
('nestedlist', 0, 'nestednestedlist', 0, 'keyA') blah blah blah
('nestedlist', 0, 'nestednestedlist', 1, 'keyZ') blah blah blah
('nestedlist', 0, 'anothernestednestedlist', 0, 'keyQ') blah blah
('nestedlist', 0, 'anothernestednestedlist', 1, 'keyW') blah
Note, we test for substring q
in target t
by using q in t
. If you would actually like to use regexp for this -
from re import compile
def search_re(t = {}, q = ""):
def loop(t, re, path): # <- add re
if isinstance(t, dict):
for k, v in t.items():
yield from loop(v, re, (*path, k)) # <- carry re
elif isinstance(t, list):
for k, v in enumerate(t):
yield from loop(v, re, (*path, k)) # <- carry re
elif isinstance(t, str):
if re.search(t): # <- re.search
yield path, t
yield from loop(t, compile(q), ()) # <- compile q
Now we can search using a regexp -
for (path, value) in search_re(d, r"[abhl]{4}"):
print(path, value)
Result -
('key1',) blah
('key2',) blah blah
('nestedlist', 0, 'nestednestedlist', 0, 'keyA') blah blah blah
('nestedlist', 0, 'nestednestedlist', 1, 'keyZ') blah blah blah
('nestedlist', 0, 'anothernestednestedlist', 0, 'keyQ') blah blah
('nestedlist', 0, 'anothernestednestedlist', 1, 'keyW') blah
Let's try another search using a different query -
for (path, value) in search_re(d, r"[dfs]{3}"):
print(path, value)
('nestedlist', 0, 'anothernestednestedlist', 0, 'id') asdf
Finally, search_substr
and search_re
yield nothing when the query does not match -
print(list(search_re(d, r"zzz")))
# []
Upvotes: 2
Reputation: 1897
Adam.Er8 is accurate, i just wanted to give a more explicit answer to the question:
def search_dict(v, re_term, prefix=''):
re_term = re.compile(re_term)
result = []
if isinstance(v, dict):
for k, v2 in v.items():
p2 = "{}['{}']".format(prefix, k)
result.extend(search_dict(v2, re_term, prefix = p2))
elif isinstance(v, list):
for i, v2 in enumerate(v):
p2 = "{}[{}]".format(prefix, i)
result.extend(search_dict(v2, re_term, prefix = p2))
elif isinstance(v, str) and re.search(re_term,v):
result.append(prefix)
return result
Upvotes: 0
Reputation: 13413
You simply need to initialize an output list, append
elements found in the current call to, and extend
it by results returned by recursive calls.
try this:
def search_dict(v, prefix=''):
result = []
if isinstance(v, dict):
for k, v2 in v.items():
p2 = "{}['{}']".format(prefix, k)
result.extend(search_dict(v2, p2))
elif isinstance(v, list):
for i, v2 in enumerate(v):
p2 = "{}[{}]".format(prefix, i)
result.extend(search_dict(v2, p2))
else:
result.append('{} = {}'.format(prefix, repr(v)))
return result
Upvotes: 1