negfrequency
negfrequency

Reputation: 1897

Recursively search for paths in a dictionary containing a sub-string

I am trying to determine the fastest way to search a nested dictionary with a regular expression and to return the paths to each occurrence of that string. I am only interested in values that are strings, not other values that may not explicitly be strings. Recursion is not my strong suit. Here is an example JSON, let's say I'm looking for all absolute paths that contain 'blah'.

d = {'id': 'abcde',
 'key1': 'blah',
 'key2': 'blah blah',
 'nestedlist': [{'id': 'qwerty',
   'nestednestedlist': [{'id': 'xyz', 'keyA': 'blah blah blah'},
    {'id': 'fghi', 'keyZ': 'blah blah blah'}],
   'anothernestednestedlist': [{'id': 'asdf', 'keyQ': 'blah blah'},
    {'id': 'yuiop', 'keyW': 'blah'}]}]}

I found the following code snippet, but am unsuccessful in making it return the paths, rather than just print them. Beyond that, it shouldn't be too hard to add an "IF value is a string AND contains re.search() THEN append the path to list".

def search_dict(v, prefix=''):
    
    if isinstance(v, dict):
        for k, v2 in v.items():
            p2 = "{}['{}']".format(prefix, k)
            search_dict(v2, p2)
    elif isinstance(v, list):
        for i, v2 in enumerate(v):
            p2 = "{}[{}]".format(prefix, i)
            search_dict(v2, p2)
    else:
        print('{} = {}'.format(prefix, repr(v)))

Upvotes: 2

Views: 682

Answers (3)

Mulan
Mulan

Reputation: 135357

Both answers here eagerly compute the result, exhausting the entire input dict before returning the first (if any) available result. We can use yield from to encode a more Pythonic program -

def search_substr(t = {}, q = ""):
  def loop(t, path):
    if isinstance(t, dict):
      for k, v in t.items():
        yield from loop(v, (*path, k))  # <- recur
    elif isinstance(t, list):
      for k, v in enumerate(t):
        yield from loop(v, (*path, k))  # <- recur
    elif isinstance(t, str):
      if q in t:
        yield path, t                   # <- output a match
  yield from loop(t, ())                # <- init

for (path, value) in search_substr(d, "blah"):
  print(path, value)

Result -

('key1',) blah
('key2',) blah blah
('nestedlist', 0, 'nestednestedlist', 0, 'keyA') blah blah blah
('nestedlist', 0, 'nestednestedlist', 1, 'keyZ') blah blah blah
('nestedlist', 0, 'anothernestednestedlist', 0, 'keyQ') blah blah
('nestedlist', 0, 'anothernestednestedlist', 1, 'keyW') blah

Note, we test for substring q in target t by using q in t. If you would actually like to use regexp for this -

from re import compile

def search_re(t = {}, q = ""):
  def loop(t, re, path):                      # <- add re
    if isinstance(t, dict):
      for k, v in t.items():
        yield from loop(v, re, (*path, k))    # <- carry re
    elif isinstance(t, list):
      for k, v in enumerate(t):
        yield from loop(v, re, (*path, k))    # <- carry re
    elif isinstance(t, str):
      if re.search(t):                        # <- re.search
        yield path, t
  yield from loop(t, compile(q), ())          # <- compile q

Now we can search using a regexp -

for (path, value) in search_re(d, r"[abhl]{4}"):
  print(path, value)

Result -

('key1',) blah
('key2',) blah blah
('nestedlist', 0, 'nestednestedlist', 0, 'keyA') blah blah blah
('nestedlist', 0, 'nestednestedlist', 1, 'keyZ') blah blah blah
('nestedlist', 0, 'anothernestednestedlist', 0, 'keyQ') blah blah
('nestedlist', 0, 'anothernestednestedlist', 1, 'keyW') blah

Let's try another search using a different query -

for (path, value) in search_re(d, r"[dfs]{3}"):
  print(path, value)
('nestedlist', 0, 'anothernestednestedlist', 0, 'id') asdf

Finally, search_substr and search_re yield nothing when the query does not match -

print(list(search_re(d, r"zzz")))
# []

Upvotes: 2

negfrequency
negfrequency

Reputation: 1897

Adam.Er8 is accurate, i just wanted to give a more explicit answer to the question:

def search_dict(v, re_term, prefix=''):

    re_term = re.compile(re_term)
    result = []
    if isinstance(v, dict):
        for k, v2 in v.items():
            p2 = "{}['{}']".format(prefix, k)
            result.extend(search_dict(v2, re_term, prefix = p2))
    elif isinstance(v, list):
        for i, v2 in enumerate(v):
            p2 = "{}[{}]".format(prefix, i)
            result.extend(search_dict(v2, re_term, prefix = p2))
    elif isinstance(v, str) and re.search(re_term,v):
        result.append(prefix)
    return result

Upvotes: 0

Adam.Er8
Adam.Er8

Reputation: 13413

You simply need to initialize an output list, append elements found in the current call to, and extend it by results returned by recursive calls.

try this:

def search_dict(v, prefix=''):
    result = []
    if isinstance(v, dict):
        for k, v2 in v.items():
            p2 = "{}['{}']".format(prefix, k)
            result.extend(search_dict(v2, p2))
    elif isinstance(v, list):
        for i, v2 in enumerate(v):
            p2 = "{}[{}]".format(prefix, i)
            result.extend(search_dict(v2, p2))
    else:
        result.append('{} = {}'.format(prefix, repr(v)))
    return result

Upvotes: 1

Related Questions