Reputation: 132018
I have several hundred thousand endpoint URLs that I want to generate stats for. For example I have:
/a/b/c
/a/b/d
/a/c/d
/b/c/d
/b/d/e
/a/b/c
/b/c/d
I want to create a dictionary that looks like this
{
'a': {
'b': {
'c': {
'_count': 2
},
'd': {
'_count': 1
}
},
'c': {
'd': {
'_count': 1
}
}
},
'b': {
'c': {
'd': {
'_count': 2
}
},
'd': {
'e': {
'_count': 1
}
}
}
}
Any clever ways to do this?
EDIT
I should mention that the paths are not always 3 parts. There might be
/a/b/c/d/e/f/g/h
... etc, etc.
Upvotes: 12
Views: 18898
Reputation: 9658
Based on the answers, I wrote a general function for setting a dictionary value along a path:
def dictPath(path, dictionary, val, sep="/"):
"set a value in a nested dictionary"
while path.startswith(sep):
path = path[1:]
parts = path.split(sep, 1)
if len(parts) > 1:
branch = dictionary.setdefault(parts[0], {})
dictPath(parts[1], branch, val, sep)
else:
dictionary[parts[0]] = val
Upvotes: 1
Reputation: 515
Old result, but still near the top in google, so I'll update: You could use dpath-python for this.
$ easy_install dpath
>>> result = {}
>>> for path in my_list_of_paths:
>>> ... dpath.util.set(result, path, SOME_VALUE)
... and that's it. I don't understand the math you're using to precompute those values on the terminus (1, 2, etc), but you could precalculate it and use a dictionary of path-to-value instead of a bare list
>>> x = {'path/name': 0, 'other/path/name': 1}
>>> for (path, value) in x.iteritems():
>>> ... dpath.util.set(result, path, value)
Something like that would work.
Upvotes: 5
Reputation: 11438
Here's my attempt:
class Result(object):
def __init__(self):
self.count = 0
self._sub_results = {}
def __getitem__(self, key):
if key not in self._sub_results:
self._sub_results[key] = Result()
return self._sub_results[key]
def __str__(self):
return "(%s, %s)" % (self.count, self._sub_results)
def __repr__(self):
return str(self)
def process_paths(paths):
path_result = Result()
for path in paths:
components = path[1:].split("/")
local_result = path_result
for component in components:
local_result = local_result[component]
local_result.count += 1
return path_result
I've wrapped up some of the logic into the Result
class to try and make the algorithm itself a little clearer.
Upvotes: 1
Reputation: 59653
EDIT:
I've amended my code to fit your last comment, above (no complex data structure now).
def dictizeString(string, dictionary):
while string.startswith('/'):
string = string[1:]
parts = string.split('/', 1)
if len(parts) > 1:
branch = dictionary.setdefault(parts[0], {})
dictizeString(parts[1], branch)
else:
if dictionary.has_key(parts[0]):
# If there's an addition error here, it's because invalid data was added
dictionary[parts[0]] += 1
else:
dictionary[parts[0]] = 1
It will store a list of [frequency, dictionary]
for each item.
Test case
>>> d = {}
>>> dictizeString('/a/b/c/d', d)
>>> dictizeString('/a/b/c/d', d)
>>> dictizeString('/a/b/c/d', d)
>>> dictizeString('/a/b/c/d', d)
>>> dictizeString('/a/b/e', d)
>>> dictizeString('/c', d)
>>> d
{'a': {'b': {'c': {'d': 4}, 'e': 1}}, 'c': 1}
Upvotes: 6
Reputation: 229603
If the paths all look like in your example, this would work:
counts = {}
for p in paths:
parts = p.split('/')
branch = counts
for part in parts[1:-1]:
branch = branch.setdefault(part, {})
branch[parts[-1]] = 1 + branch.get(parts[-1], 0)
This uses dictionary methods like setdefault()
and get()
to avoid having to write a lot of if-statements.
Note that this will not work if a path that has subdirectories can also appear on it's own. Then it's not clear if the corresponding part of counts
should contain a number or another dictionary. In this case it would probably be best to store both a count and a dict for each node, using a tuple or a custom class.
The basic algorithm stays the same:
class Stats(object):
def __init__(self):
self.count = 0
self.subdirs = {}
counts = Stats()
for p in paths:
parts = p.split('/')
branch = counts
for part in parts[1:]:
branch = branch.subdirs.setdefault(part, Stats())
branch.count += 1
With some pretty printing you get:
def printstats(stats, indent=''):
print indent + str(stats.count) + ' times'
for (d, s) in stats.subdirs.items():
print indent + d + ':'
printstats(s, indent + ' ')
>>> printstats(counts)
0 times
a:
0 times
c:
0 times
d:
1 times
b:
0 times
c:
2 times
d:
1 times
...
Upvotes: 13