sberry
sberry

Reputation: 132018

Python: recursively create dictionary from paths

I have several hundred thousand endpoint URLs that I want to generate stats for. For example I have:

/a/b/c
/a/b/d
/a/c/d
/b/c/d
/b/d/e
/a/b/c
/b/c/d

I want to create a dictionary that looks like this

{
    'a': {
        'b': {
            'c': {
                '_count': 2
            },
            'd': {
                '_count': 1
            }
        },
        'c': {
            'd': {
                '_count': 1
            }
        }
    },
    'b': {
        'c': {
            'd': {
                '_count': 2
            }
        },
        'd': {
            'e': {
                '_count': 1
            }
        }
    }
}

Any clever ways to do this?

EDIT

I should mention that the paths are not always 3 parts. There might be /a/b/c/d/e/f/g/h... etc, etc.

Upvotes: 12

Views: 18898

Answers (5)

Ahmad
Ahmad

Reputation: 9658

Based on the answers, I wrote a general function for setting a dictionary value along a path:

def dictPath(path, dictionary, val, sep="/"):
    "set a value in a nested dictionary"
    while path.startswith(sep):
        path = path[1:]
    parts = path.split(sep, 1)
    if len(parts) > 1:
        branch = dictionary.setdefault(parts[0], {})
        dictPath(parts[1], branch, val, sep)
    else:
        dictionary[parts[0]] = val

Upvotes: 1

Andrew Kesterson
Andrew Kesterson

Reputation: 515

Old result, but still near the top in google, so I'll update: You could use dpath-python for this.

$ easy_install dpath
>>> result = {}
>>> for path in my_list_of_paths:
>>> ... dpath.util.set(result, path, SOME_VALUE)

... and that's it. I don't understand the math you're using to precompute those values on the terminus (1, 2, etc), but you could precalculate it and use a dictionary of path-to-value instead of a bare list

>>> x = {'path/name': 0, 'other/path/name': 1}
>>> for (path, value) in x.iteritems():
>>> ... dpath.util.set(result, path, value)

Something like that would work.

Upvotes: 5

Michael Williamson
Michael Williamson

Reputation: 11438

Here's my attempt:

class Result(object):
    def __init__(self):
        self.count = 0
        self._sub_results = {}

    def __getitem__(self, key):
        if key not in self._sub_results:
            self._sub_results[key] = Result()
        return self._sub_results[key]

    def __str__(self):
        return "(%s, %s)" % (self.count, self._sub_results)

    def __repr__(self):
        return str(self)

def process_paths(paths):
    path_result = Result()
    for path in paths:
        components = path[1:].split("/")
        local_result = path_result
        for component in components:
            local_result = local_result[component]
        local_result.count += 1
    return path_result

I've wrapped up some of the logic into the Result class to try and make the algorithm itself a little clearer.

Upvotes: 1

Smashery
Smashery

Reputation: 59653

EDIT:

I've amended my code to fit your last comment, above (no complex data structure now).

def dictizeString(string, dictionary):
    while string.startswith('/'):
        string = string[1:]
    parts = string.split('/', 1)
    if len(parts) > 1:
        branch = dictionary.setdefault(parts[0], {})
        dictizeString(parts[1], branch)
    else:
        if dictionary.has_key(parts[0]):
             # If there's an addition error here, it's because invalid data was added
             dictionary[parts[0]] += 1
        else:
             dictionary[parts[0]] = 1

It will store a list of [frequency, dictionary] for each item.

Test case

>>> d = {}
>>> dictizeString('/a/b/c/d', d)
>>> dictizeString('/a/b/c/d', d)
>>> dictizeString('/a/b/c/d', d)
>>> dictizeString('/a/b/c/d', d)
>>> dictizeString('/a/b/e', d)
>>> dictizeString('/c', d)
>>> d
{'a': {'b': {'c': {'d': 4}, 'e': 1}}, 'c': 1}

Upvotes: 6

sth
sth

Reputation: 229603

If the paths all look like in your example, this would work:

counts = {}
for p in paths:
   parts = p.split('/')
   branch = counts
   for part in parts[1:-1]:
      branch = branch.setdefault(part, {})
   branch[parts[-1]] = 1 + branch.get(parts[-1], 0)

This uses dictionary methods like setdefault() and get() to avoid having to write a lot of if-statements.

Note that this will not work if a path that has subdirectories can also appear on it's own. Then it's not clear if the corresponding part of counts should contain a number or another dictionary. In this case it would probably be best to store both a count and a dict for each node, using a tuple or a custom class.

The basic algorithm stays the same:

class Stats(object):
   def __init__(self):
      self.count = 0
      self.subdirs = {}

counts = Stats()
for p in paths:
   parts = p.split('/')
   branch = counts
   for part in parts[1:]:
      branch = branch.subdirs.setdefault(part, Stats())
   branch.count += 1

With some pretty printing you get:

def printstats(stats, indent=''):
   print indent + str(stats.count) + ' times'
   for (d, s) in stats.subdirs.items():
       print indent + d + ':'
       printstats(s, indent + '  ')

>>> printstats(counts)
0 times
a:
  0 times
  c:
    0 times
    d:
      1 times
  b:
    0 times
    c:
      2 times
    d:
      1 times
...

Upvotes: 13

Related Questions