Reputation: 1315
In my Python project I have a pool of objects, each tagged with a set of words. I want to generate all sets including subsets of tags, mapped to the linked objects. Those subsets should not be smaller than a complete tag set of any project. For instance, let's say there are these objects with their tags:
apple: fruit, green, nature
sometree: tree, green, wood, nature
banana: fruit, nature
someplant: green, wood, nature
otherplant: green, wood, nature
The result should be:
(fruit, nature): banana, apple
(fruit, green, nature): apple
(green, wood, nature): sometree, someplant, otherplant
(green, wood, nature, tree): sometree
I don't want the result to include tag sets that don't exists as a complete tag set for at least one objects.
To achieve this, I came up with a O(n²)
algorithm, but wonder if there are more clever approaches like e.g. some tree-based index or a prefix-trie? I won't have too much objects, maybe 2000. However, it should be fast.
My current solution iterates all objects to create a dictionary mapping all tag sets to objects. In a second iteration I test each tag set if it is a subset of any other tag sets and if so, remember the objects of the supersets.
Update: In O(n²)
abovel, n
referrs to the number of objects. I will have many objects where each has about five tags.
Solution: Thanks for all the replies. I ended up using the method of azorius, because it is both fast and robust. This is a complete example to list all groups:
tagged = {
'apple': ['fruit', 'green', 'nature'],
'sometree': ['tree', 'green', 'wood', 'nature'],
'banana': ['fruit', 'nature'],
'someplant': ['green', 'wood', 'nature'],
'otherplant': ['green', 'wood', 'nature']
}
tag_sets = set()
tags_to_objects = {}
for obj, tags in tagged.items():
for tag in tags:
try:
tags_to_objects[tag].add(obj)
except KeyError:
tags_to_objects[tag] = set([obj])
# Record all tag sets
tag_set = frozenset(tags)
tag_sets.add(tag_set)
groups = {}
for tag_set in tag_sets:
objects = None
for tag in tag_set:
if objects:
objects = objects.intersection(tags_to_objects[tag])
else:
objects = tags_to_objects[tag]
groups[tag_set] = objects
for k,v in groups.items():
print '(%s): %s' % (', '.join(k), ', '.join(v))
Since there where several solutions, I did some timing tests on 1035 objects with about five tags each:
The code of pillmuncher is in my opinion the most elegant.
Upvotes: 3
Views: 1678
Reputation: 1015
why not use sets?, according to http://wiki.python.org/moin/TimeComplexity sets have the following average time complexity O(min(len(s), len(t)), aka O(n)!
fruit = set(("apple", "banana"))
green = set(("apple", "sometree", "someplant", "otherplant"))
nature = set(("apple", "banana", "sometree", "someplant", "otherplant"))
tree = set (("sometree",))
wood = set (("sometree", "someplant", "otherplant"))
In [90]: fruit.intersection(nature)
Out[90]: set(['apple', 'banana'])
In [91]: fruit.intersection(nature).intersection(green)
Out[91]: set(['apple'])
In [92]: green.intersection(wood).intersection(nature)
Out[92]: set(['sometree', 'someplant', 'otherplant'])
In [93]: green.intersection(wood).intersection(nature).intersection(tree)
Out[93]: set(['sometree'])
looking back at this code, a much more elegant answer would be to use reduce:
In [90]: reduce(set.intersection, [fruit, nature])
Out[90]: set(['apple', 'banana'])
In [91]: reduce(set.intersection, [fruit, nature, green])
Out[91]: set(['apple'])
In [92]: reduce(set.intersection, [green, wood, nature])
Out[92]: set(['sometree', 'someplant', 'otherplant'])
In [93]: reduce(set.intersection, [green, wood, nature, tree])
Out[93]: set(['sometree'])
Upvotes: 1
Reputation: 10162
The Python Standard Library is your friend:
from collections import defaultdict
from itertools import combinations
tagged = {
'apple': ['fruit', 'green', 'nature'],
'sometree': ['tree', 'green', 'wood', 'nature'],
'banana': ['fruit', 'nature'],
'someplant': ['green', 'wood', 'nature'],
'otherplant': ['green', 'wood', 'nature']
}
tag_sets = defaultdict(set)
for each, tags in tagged.items():
tag_sets[frozenset(tags)].add(each)
for a, b in combinations(tag_sets, 2):
if a < b:
tag_sets[a].update(tag_sets[b])
elif b < a:
tag_sets[b].update(tag_sets[a])
for k, v in tag_sets.items():
print '(%s): %s' % (', '.join(k), ', '.join(v))
Result:
(wood, green, nature): sometree, someplant, otherplant
(fruit, green, nature): apple
(fruit, nature): banana, apple
(green, wood, tree, nature): sometree
How it works: First, we transpose the mapping item -> tag set to tag set -> set of items associated with said tag set. Then we iterate over all distinct 2-sets of tag sets, represented by pairs (a, b) and check if either a or b is a proper subset of the other. If it is, we join the mapped items of the superset with the mapped items of the proper subset.
[EDIT]
Here is another solution:
from collections import defaultdict
tagged = {
'apple': ['fruit', 'green', 'nature'],
'sometree': ['tree', 'green', 'wood', 'nature'],
'banana': ['fruit', 'nature'],
'someplant': ['green', 'wood', 'nature'],
'otherplant': ['green', 'wood', 'nature']
}
intersection = set(tagged).intersection
tag_sets = set()
items = defaultdict(set)
for item, tags in tagged.items():
tag_sets.add(frozenset(tags))
for tag in tags:
items[tag].add(item)
for tag_set in tag_sets:
result = intersection(*(items[tag] for tag in tag_set))
print '(%s): %s' % (', '.join(tag_set), ', '.join(result))
I don't know if it is any faster, since i don't have enough data to do a meaningful comparison.
[/EDIT]
Upvotes: 1
Reputation: 7132
Well, not sure if it's better but I had fun writing this tree.
It uses dictionnary for nodes (should use something better as I use the tag "data" which is unclean)
Root is root of tree, it gets filled by parsing initial data
About complexity, it's unclear what is n :
I suppose first solution is mote common.
I I'm not mistaken, my solution runs in x.y.ln(y) (I need to sort tags per project) to create the tree.
Displaying the tree is y.y (but uses recursion and a temporary list so it must bloat memory).
d= {
"apple": ["fruit", "green", "nature"],
"sometree": ["tree", "green", "wood", "nature"],
"banana": ["fruit", "nature"],
"someplant": ["green", "wood", "nature"],
"otherplant": ["green", "wood", "nature"]
}
root=dict()
root['data']=[]
for k,v in d.iteritems():
v.sort() #
r=root
for c in v:
try:
r=r[c]
except KeyError:
r[c]=dict()
r=r[c]
r['data']=[]
r['data']+=[k]
def dump(r,hist):
result=r['data'][:]
for x in r.keys():
if x=='data':
continue
result+=dump(r[x],hist[:]+[x])
if len(result)>0 and len(r['data'])>0:
print (hist,result)
return result
dump(root,[])
Code is far from perfect (quick and dirty mode) but it looks like it works:
$ c:\Python27\python.exe temp.py
(['green', 'nature', 'tree', 'wood'], ['sometree'])
(['green', 'nature', 'wood'], ['otherplant', 'someplant'])
(['fruit', 'green', 'nature'], ['apple'])
(['fruit', 'nature'], ['banana'])
Upvotes: 1
Reputation: 1399
May be you can use bipartite graph. Let O and T forms a bipartite graph. O are the objects and T are the tags. An object and a tag has edge means, that object has that tag.
You have to maintain adjacency lists for T and adjacency matrix for O. and hash table for the tags vertices and their degrees.
Now input is some tags. Check their degrees and find out which has smallest degree.
Now get the neighbours of it from adjacency lists and iterate over it. Check if each of them has an edge has between to other tags. use adjacency matrix for it. If it has edges to each of the other tags, store it. Else if has not any edge to any of the other tags, discard it.
If t is the number of tags, and o is the number of objects. Building up adjacency list, matrix and degree-array will take O(t*o) time and O(t*o) space.
After that, at each query of t' tags, it will take O(t'*o) time.
Hope this helps.
Upvotes: 1