Python - comparing sets and returning the one with the most matching elements

Question

I have a set of Strings: {'Type A', 'Type B', 'Type C'} for instance, I'll call it x. The set can have up to 10 strings.

There is also a big list of sets, for instance [{'Type A', 'Type B', 'Type C'}, {'Type A', 'Type B', 'Type C'}, {'Type B', 'Type C, 'Type D'}, {'Type E', 'Type F', 'Type G'}] and so on.

My goal is to return all the sets in the big list that contain 60% or more of the same elements as x. So in this example, it would return the first 3 sets but not the 4th.

I know I could iterate over every set, compare elements, and then use the number of similarities to go about my business, but this is quite time intensive and my big list will probably have many many sets. Is there a better way to go about this? I thought about using frozenset() and hashing them, but I'm not sure what hashing function I would use, and how I would compare hashes.

Any help would be appreciated - many thanks!

Synthaze · Accepted Answer

l = [{'Type A', 'Type B', 'Type C'}, {'Type A', 'Type B', 'Type C'}, {'Type B', 'Type C', 'Type D'}, {'Type E', 'Type F', 'Type G'}]

x = {'Type A', 'Type B', 'Type C'}

for s in l:
    print (len(x.intersection(s)))

Output:

With a function and a list of tuples returned:

def more_than(l,n):
    return [ (s,round(len(x.intersection(s))/len(x),2)) for s in l if len(x.intersection(s))/len(x) > n]
 
print (more_than(l,0.6))

Output:

[({'Type B', 'Type A', 'Type C'}, 1.0), ({'Type B', 'Type A', 'Type C'}, 1.0), ({'Type B', 'Type C', 'Type D'}, 0.67)]

Here, just for convenience, I used round(len(x.intersection(s))/len(x),2) which translates to round(x,y). The round() will simply round your ratio to the number of decimal mentioned using the y variable.

Python - comparing sets and returning the one with the most matching elements

Answers (2)

Related Questions