Reputation: 123
I have a collection (dictionary) of sets of strings, like {1: {'a', 'b'}, ...}
I need to find the n largest intersections i.e. intersections of the largest subsets of the collection. The obvious brute-force approach:
for i in range(len(collection),2,-1):
for subset in combinations(sorted(collection), i):
intersected = set.intersection(*(collection[k] for k in subset))
if len(intersected)>0:
yield len(subset), intersected
is very slow. Is there some efficient way/library to do this?
Upvotes: 1
Views: 2182
Reputation: 11297
I assume you want to find the n largest subsets of keys of your dictionary for which the intersection of the corresponding sets is not empty. So, as in your example (in the comments to your post) the largest such subset is represented by the keys 1,2, and 4 and the intersection of the corresponding sets contains at least one element ('b' in our example).
Largest subset: Instead of generating all possible subsets of keys, you can just iterate over all the different set elements (a, b, etc.) and calculate the number of sets in which they occur. The element with the highest count leads you directly to the solution, the largest subset.
Example: In your example you would have obtained the following intermediate result:
a: 2, b: 3, c: 1, e: 1, f: 1
You immediately see that the element b is present in more sets than any other element. The sets containing b represent the solution.
N largest subsets: The n largest subsets can be easily generated from the intermediate result, you just have to check for duplicates. In your example the largest subset has size 3 and is the only one of its size. The next largest subset will have size 2. You obtain it via the elements occurring in two or more sets, i.e. a and b. So there are 3 ways you can pick 2 out of the 3 b-sets and one solution consisting of the two a-sets.
Upvotes: 1
Reputation: 881
Just count the number of occurances of each string. The maximum number of occurances is the largest intersection of subsets (assuming a strings are unique in each subset):
coll = {1:{'a','b'}, 2:{'b','e'}, 3:{'a','c'}, 4:{'b','f'}}
print(coll)
d=dict()
for subs in coll.values():
for s in subs:
d[s]=d.setdefault(s, 0)+1
m=max(d.values())
print(m)
Upvotes: 2