AndW
AndW

Reputation: 846

Python - comparing sets and returning the one with the most matching elements

I have a set of Strings: {'Type A', 'Type B', 'Type C'} for instance, I'll call it x. The set can have up to 10 strings.

There is also a big list of sets, for instance [{'Type A', 'Type B', 'Type C'}, {'Type A', 'Type B', 'Type C'}, {'Type B', 'Type C, 'Type D'}, {'Type E', 'Type F', 'Type G'}] and so on.

My goal is to return all the sets in the big list that contain 60% or more of the same elements as x. So in this example, it would return the first 3 sets but not the 4th.

I know I could iterate over every set, compare elements, and then use the number of similarities to go about my business, but this is quite time intensive and my big list will probably have many many sets. Is there a better way to go about this? I thought about using frozenset() and hashing them, but I'm not sure what hashing function I would use, and how I would compare hashes.

Any help would be appreciated - many thanks!

Upvotes: 0

Views: 906

Answers (2)

Synthaze
Synthaze

Reputation: 6090

l = [{'Type A', 'Type B', 'Type C'}, {'Type A', 'Type B', 'Type C'}, {'Type B', 'Type C', 'Type D'}, {'Type E', 'Type F', 'Type G'}]

x = {'Type A', 'Type B', 'Type C'}

for s in l:
    print (len(x.intersection(s)))

Output:

3
3
2
0

With a function and a list of tuples returned:

def more_than(l,n):
    return [ (s,round(len(x.intersection(s))/len(x),2)) for s in l if len(x.intersection(s))/len(x) > n]
 
print (more_than(l,0.6))

Output:

[({'Type B', 'Type A', 'Type C'}, 1.0), ({'Type B', 'Type A', 'Type C'}, 1.0), ({'Type B', 'Type C', 'Type D'}, 0.67)]

Here, just for convenience, I used round(len(x.intersection(s))/len(x),2) which translates to round(x,y). The round() will simply round your ratio to the number of decimal mentioned using the y variable.

Upvotes: 4

user7864386
user7864386

Reputation:

How about this?

x = {'Type A', 'Type B', 'Type C'}
lst = [{'Type A', 'Type B', 'Type C'}, 
       {'Type A', 'Type B', 'Type C'}, 
       {'Type B', 'Type C', 'Type D'},
       {'Type E', 'Type F', 'Type G'}]    
[s for s in lst if len(s.intersection(x)) > len(x) * 0.6]

Upvotes: 1

Related Questions