tomka
tomka

Reputation: 1315

Python: data structure to index sets to find supersets?

In my Python project I have a pool of objects, each tagged with a set of words. I want to generate all sets including subsets of tags, mapped to the linked objects. Those subsets should not be smaller than a complete tag set of any project. For instance, let's say there are these objects with their tags:

apple: fruit, green, nature
sometree: tree, green, wood, nature
banana: fruit, nature
someplant: green, wood, nature
otherplant: green, wood, nature

The result should be:

(fruit, nature): banana, apple
(fruit, green, nature): apple
(green, wood, nature): sometree, someplant, otherplant
(green, wood, nature, tree): sometree

I don't want the result to include tag sets that don't exists as a complete tag set for at least one objects.

To achieve this, I came up with a O(n²) algorithm, but wonder if there are more clever approaches like e.g. some tree-based index or a prefix-trie? I won't have too much objects, maybe 2000. However, it should be fast.

My current solution iterates all objects to create a dictionary mapping all tag sets to objects. In a second iteration I test each tag set if it is a subset of any other tag sets and if so, remember the objects of the supersets.

Update: In O(n²) abovel, n referrs to the number of objects. I will have many objects where each has about five tags.

Solution: Thanks for all the replies. I ended up using the method of azorius, because it is both fast and robust. This is a complete example to list all groups:

tagged = { 
    'apple': ['fruit', 'green', 'nature'],
    'sometree': ['tree', 'green', 'wood', 'nature'],
    'banana': ['fruit', 'nature'],
    'someplant': ['green', 'wood', 'nature'],
    'otherplant': ['green', 'wood', 'nature']
}   

tag_sets = set()
tags_to_objects = {}

for obj, tags in tagged.items():
    for tag in tags:
        try:
            tags_to_objects[tag].add(obj)
        except KeyError:
            tags_to_objects[tag] = set([obj])
    # Record all tag sets
    tag_set = frozenset(tags)
    tag_sets.add(tag_set)

groups = {}
for tag_set in tag_sets:
    objects = None
    for tag in tag_set:
        if objects:
            objects = objects.intersection(tags_to_objects[tag])
        else:
            objects = tags_to_objects[tag]
    groups[tag_set] = objects

for k,v in groups.items():
    print '(%s): %s' % (', '.join(k), ', '.join(v))

Since there where several solutions, I did some timing tests on 1035 objects with about five tags each:

The code of pillmuncher is in my opinion the most elegant.

Upvotes: 3

Views: 1678

Answers (4)

jcr
jcr

Reputation: 1015

why not use sets?, according to http://wiki.python.org/moin/TimeComplexity sets have the following average time complexity O(min(len(s), len(t)), aka O(n)!

fruit = set(("apple", "banana"))
green = set(("apple", "sometree", "someplant", "otherplant"))
nature = set(("apple", "banana", "sometree", "someplant", "otherplant"))
tree = set (("sometree",))
wood = set (("sometree", "someplant", "otherplant"))

In [90]: fruit.intersection(nature)
Out[90]: set(['apple', 'banana'])

In [91]: fruit.intersection(nature).intersection(green)
Out[91]: set(['apple'])

In [92]: green.intersection(wood).intersection(nature)
Out[92]: set(['sometree', 'someplant', 'otherplant'])

In [93]: green.intersection(wood).intersection(nature).intersection(tree)
Out[93]: set(['sometree'])

looking back at this code, a much more elegant answer would be to use reduce:

In [90]: reduce(set.intersection, [fruit, nature])
Out[90]: set(['apple', 'banana'])

In [91]: reduce(set.intersection, [fruit, nature, green])
Out[91]: set(['apple'])

In [92]: reduce(set.intersection, [green, wood, nature])
Out[92]: set(['sometree', 'someplant', 'otherplant'])

In [93]: reduce(set.intersection, [green, wood, nature, tree])
Out[93]: set(['sometree'])

Upvotes: 1

pillmuncher
pillmuncher

Reputation: 10162

The Python Standard Library is your friend:

from collections import defaultdict
from itertools import combinations

tagged = {
    'apple': ['fruit', 'green', 'nature'],
    'sometree': ['tree', 'green', 'wood', 'nature'],
    'banana': ['fruit', 'nature'],
    'someplant': ['green', 'wood', 'nature'],
    'otherplant': ['green', 'wood', 'nature']
}

tag_sets = defaultdict(set)

for each, tags in tagged.items():
    tag_sets[frozenset(tags)].add(each)

for a, b in combinations(tag_sets, 2):
    if a < b:
        tag_sets[a].update(tag_sets[b])
    elif b < a:
        tag_sets[b].update(tag_sets[a])

for k, v in tag_sets.items():
    print '(%s): %s' % (', '.join(k), ', '.join(v))

Result:

(wood, green, nature): sometree, someplant, otherplant
(fruit, green, nature): apple
(fruit, nature): banana, apple
(green, wood, tree, nature): sometree

How it works: First, we transpose the mapping item -> tag set to tag set -> set of items associated with said tag set. Then we iterate over all distinct 2-sets of tag sets, represented by pairs (a, b) and check if either a or b is a proper subset of the other. If it is, we join the mapped items of the superset with the mapped items of the proper subset.

[EDIT]

Here is another solution:

from collections import defaultdict

tagged = {
    'apple': ['fruit', 'green', 'nature'],
    'sometree': ['tree', 'green', 'wood', 'nature'],
    'banana': ['fruit', 'nature'],
    'someplant': ['green', 'wood', 'nature'],
    'otherplant': ['green', 'wood', 'nature']
}

intersection = set(tagged).intersection
tag_sets = set()
items = defaultdict(set)

for item, tags in tagged.items():
    tag_sets.add(frozenset(tags))
    for tag in tags:
        items[tag].add(item)

for tag_set in tag_sets:
    result = intersection(*(items[tag] for tag in tag_set))
    print '(%s): %s' % (', '.join(tag_set), ', '.join(result))

I don't know if it is any faster, since i don't have enough data to do a meaningful comparison.

[/EDIT]

Upvotes: 1

Bruce
Bruce

Reputation: 7132

Well, not sure if it's better but I had fun writing this tree.

It uses dictionnary for nodes (should use something better as I use the tag "data" which is unclean)

Root is root of tree, it gets filled by parsing initial data

About complexity, it's unclear what is n :

  • if you have lots of project (x) with few tags per project (y) then you want a good complexity on x and you don't care about y
  • if you have a few projects with lots of tags, then you want the contrary

I suppose first solution is mote common.

I I'm not mistaken, my solution runs in x.y.ln(y) (I need to sort tags per project) to create the tree.

Displaying the tree is y.y (but uses recursion and a temporary list so it must bloat memory).

d= {
"apple": ["fruit", "green", "nature"],
"sometree": ["tree", "green", "wood", "nature"],
"banana": ["fruit", "nature"],
"someplant": ["green", "wood", "nature"],
"otherplant": ["green", "wood", "nature"]
}


root=dict()
root['data']=[]
for k,v in d.iteritems():
   v.sort() # 
   r=root
   for c in v:
       try:
           r=r[c]
       except KeyError:
           r[c]=dict()
           r=r[c]
           r['data']=[]
   r['data']+=[k]

def dump(r,hist):
    result=r['data'][:]
    for x in r.keys():
        if x=='data':
            continue
        result+=dump(r[x],hist[:]+[x])
    if len(result)>0 and len(r['data'])>0:
        print (hist,result)
    return result

dump(root,[])

Code is far from perfect (quick and dirty mode) but it looks like it works:

$ c:\Python27\python.exe temp.py
(['green', 'nature', 'tree', 'wood'], ['sometree'])
(['green', 'nature', 'wood'], ['otherplant', 'someplant'])
(['fruit', 'green', 'nature'], ['apple'])
(['fruit', 'nature'], ['banana'])

Upvotes: 1

rnbguy
rnbguy

Reputation: 1399

May be you can use bipartite graph. Let O and T forms a bipartite graph. O are the objects and T are the tags. An object and a tag has edge means, that object has that tag.

You have to maintain adjacency lists for T and adjacency matrix for O. and hash table for the tags vertices and their degrees.

Now input is some tags. Check their degrees and find out which has smallest degree.

Now get the neighbours of it from adjacency lists and iterate over it. Check if each of them has an edge has between to other tags. use adjacency matrix for it. If it has edges to each of the other tags, store it. Else if has not any edge to any of the other tags, discard it.

If t is the number of tags, and o is the number of objects. Building up adjacency list, matrix and degree-array will take O(t*o) time and O(t*o) space.

After that, at each query of t' tags, it will take O(t'*o) time.

Hope this helps.

Upvotes: 1

Related Questions