John Chrysostom
John Chrysostom

Reputation: 3963

Find most similar sets efficiently (Python, data structures)

Suppose I have a couple thousand Python sets in a list called my_sets. For every set "A" in my_sets, I want to find the five sets (sets "B") in my_sets that contain the highest percentage of set A's members.

I'm currently storing the data as sets and looping over them twice to calculate overlap...

from random import randint
from heapq import heappush, heappop

my_sets = []

for i in range(20):
    new_set = set()

    for j in range(20):
        new_set.add(randint(0, 50))

    my_sets.append(new_set)

for i in range(len(my_sets)):
    neighbor_heap = []

    for j in range(len(my_sets)):
        if i == j:
            continue

        heappush(neighbor_heap, (1 / len(my_sets[i] & my_sets[j]), j))

    results = []

    while len(results) < 5:
        results.append(heappop(neighbor_heap)[1])

    print('Closest neighbors to set {} are sets {}'.format(i, results))

However, this is obviously an O(N**2) algorithm, so it blows up when my_sets gets long. Is there a better data structure or algorithm that can be implemented in base Python for tackling this? There is no reason that my_sets has to be a list, or that each individual set actually has to be a Python set. Any way of storing whether or not each set contains members from a finite list of options would be fine (e.g., a bunch of bools or bits in a standardized order). And building a more exotic data structure to save time would also be fine.

(As some people will likely want to point out, I could, of course, structure this as a Numpy array where rows are sets and columns are elements and cells are a 1/0, depending on whether that element is in that set. Then, you'd just do some Numpy operations. This would undoubtedly be faster, but I haven't really improved my algorithm at all, I've just offloaded complexity to somebody else's optimized C/Fortran/whatever.)

EDIT After a full test, the algorithm I originally posted runs in ~586 seconds under the agreed test conditions.

Upvotes: 3

Views: 947

Answers (4)

Alain T.
Alain T.

Reputation: 42143

You could reduce the number of passes through the set list by building a list of set indexes associated to each value. Then, one additional pass through the set list, you can determine which sets have the most common values by compiling the number of indexes for each value of a set.

This will improve performance in some cases but, depending on the density of the data it may not be a huge difference.

Here is an example using defaultdict and Counter from the collections module.

from collections import defaultdict,Counter
def top5Matches(setList):
    valueSets = defaultdict(list)
    for i,aSet in enumerate(setList):
        for v in aSet: valueSets[v].append(i)
    results = []
    for i,aSet in enumerate(setList):
        counts = Counter()
        for v in aSet: counts.update(valueSets[v])
        counts[i] = 0
        top5      = [setList[j] for j,_ in counts.most_common(5)]
        results.append((aSet,top5))
    return results

In order to compare execution times I took the liberty of embedding your solution in a function. I also had to make a fix for cases where two sets would have no intersection at all:

from heapq import heappush, heappop
def OPSolution(my_sets):
    results = []
    for i in range(len(my_sets)):
        neighbor_heap = []
        for j in range(len(my_sets)):
            if i == j: continue
            heappush(neighbor_heap, (1 / max(1,len(my_sets[i] & my_sets[j])), j))
        top5 = []
        while len(top5) < 5:
            j = heappop(neighbor_heap)[1]
            top5.append(my_sets[j])
        results.append((my_sets[i],top5))    
    return results

Both function return a list of tuples containing the original set and a list of the top 5 sets based on number of common values.

The two functions produce the same results although the top 5 sets may not be the same when the intersection count is the identical for a 6th (or more) additional sets.

from random import randrange

my_sets = [ set(randrange(50) for _ in range(20)) for _ in range(20) ]
opResults = OPSolution(my_sets)
print("OPSolution: (matching counts)")
for i,(aSet,top5) in enumerate(opResults):
    print(i,"Top 5:",[len(aSet&otherSet) for otherSet in top5])
print("")

print("top5Matches: (matching counts)")
t5mResults = top5Matches(my_sets)
for i,(aSet,top5) in enumerate(t5mResults):
    print(i,"Top 5:",[len(aSet&otherSet) for otherSet in top5])
print("")

Output:

OPSolution: (matching counts)
0 Top 5: [8, 7, 7, 7, 6]
1 Top 5: [7, 6, 6, 6, 6]
2 Top 5: [8, 7, 6, 6, 6]
3 Top 5: [8, 7, 7, 6, 6]
4 Top 5: [9, 8, 8, 8, 8]
5 Top 5: [7, 6, 6, 6, 6]
6 Top 5: [8, 8, 8, 7, 6]
7 Top 5: [8, 8, 7, 7, 7]
8 Top 5: [9, 7, 7, 7, 6]
9 Top 5: [8, 8, 8, 7, 7]
10 Top 5: [8, 8, 7, 7, 7]
11 Top 5: [8, 8, 7, 7, 6]
12 Top 5: [8, 7, 7, 7, 7]
13 Top 5: [8, 8, 8, 6, 6]
14 Top 5: [9, 8, 8, 6, 6]
15 Top 5: [6, 6, 5, 5, 5]
16 Top 5: [9, 7, 7, 6, 6]
17 Top 5: [8, 7, 7, 7, 7]
18 Top 5: [8, 8, 7, 6, 6]
19 Top 5: [7, 6, 6, 6, 6]

top5Matches: (matching counts)
0 Top 5: [8, 7, 7, 7, 6]
1 Top 5: [7, 6, 6, 6, 6]
2 Top 5: [8, 7, 6, 6, 6]
3 Top 5: [8, 7, 7, 6, 6]
4 Top 5: [9, 8, 8, 8, 8]
5 Top 5: [7, 6, 6, 6, 6]
6 Top 5: [8, 8, 8, 7, 6]
7 Top 5: [8, 8, 7, 7, 7]
8 Top 5: [9, 7, 7, 7, 6]
9 Top 5: [8, 8, 8, 7, 7]
10 Top 5: [8, 8, 7, 7, 7]
11 Top 5: [8, 8, 7, 7, 6]
12 Top 5: [8, 7, 7, 7, 7]
13 Top 5: [8, 8, 8, 6, 6]
14 Top 5: [9, 8, 8, 6, 6]
15 Top 5: [6, 6, 5, 5, 5]
16 Top 5: [9, 7, 7, 6, 6]
17 Top 5: [8, 7, 7, 7, 7]
18 Top 5: [8, 8, 7, 6, 6]
19 Top 5: [7, 6, 6, 6, 6]

Comparing execution times for various combinations of setting shows that the indexing by value performs better on larger data sets (albeit not by much in some cases):

[EDIT] Added Chris Hall's solution to measure the speed improvements provided by limiting the functionality to sets of values in a consecutive range. I also had to embed it in a function and test that results were the same. I realized while doing it that, we essentially had the same approach. The main difference is that Chris uses a list instead of a dictionary which constrains the values to a range() for which the size must be provided.

def chrisHall(my_sets,valueRange):
    results = []
    my_inv_sets = [[] for i in range(valueRange)]
    for i in range(len(my_sets)):
        for j in range(valueRange):
            if j in my_sets[i]:
                my_inv_sets[j].append(i)

    for i in range(len(my_sets)):
        counter = Counter()

        for j in my_sets[i]:
            counter.update(my_inv_sets[j])

        top5 = [my_sets[j] for j,_ in counter.most_common(6)[1:]]
        results.append((my_sets[i],top5))
    return results

Performance tests were also embedded in a function to avoid repeating the boilerplate code:

from random import randrange
from timeit import timeit

def compareSolutions(title,setCount,setSize,valueRange,count=1):

    print("-------------------")
    print(title,setCount,"sets of",setSize,"elements in range 0 ...",valueRange)
    testSets = [ set(randrange(valueRange) for _ in range(setSize)) for _ in range(setCount) ]

    t = timeit(lambda: chrisHall(testSets,valueRange),number=count)
    print("chrisHall",t)

    t = timeit(lambda: top5Matches(testSets),number=count)
    print("top5Matches",t)

    t = timeit(lambda: OPSolution(testSets),number=count)
    print("OPSolution",t)

compareSolutions("SIMPLE TEST SET",20,20,50,count=100)
compareSolutions("MORE SETS:",2000,20,50)
compareSolutions("FEWER INTERSECTIONS:",2000,20,500)
compareSolutions("LARGER SETS:",2000,200,500)
compareSolutions("SETTING FROM COMMENTS:",10000,200,1000)

Results:

-------------------
SIMPLE TEST SET 20 sets of 20 elements in range 0 ... 50
chrisHall 0.0766431910000005
top5Matches 0.07549873900000037
OPSolution 0.05089954700000021
-------------------
MORE SETS: 2000 sets of 20 elements in range 0 ... 50
chrisHall 1.274499733999999
top5Matches 1.2646208220000013
OPSolution 3.796912927000001
-------------------
FEWER INTERSECTIONS: 2000 sets of 20 elements in range 0 ... 500
chrisHall 0.4685694170000012
top5Matches 0.42844527900000173
OPSolution 3.5187148479999983
-------------------
LARGER SETS: 2000 sets of 200 elements in range 0 ... 500
chrisHall 8.538208329
top5Matches 8.51855685
OPSolution 23.192823251999997
-------------------
SETTING FROM COMMENTS: 10000 sets of 200 elements in range 0 ... 1000
chrisHall 190.55364428999997
top5Matches 176.066835327
OPSolution 829.934181724

Upvotes: 1

Chris Hall
Chris Hall

Reputation: 1772

Could you:

  1. invert the sets, to produce for each set element, a list (or set) of the sets which contain it.

    which is O(n*m) -- for n sets and on average m elements per set.

  2. for each set S, consider its elements, and (using 1) construct a list (or heap) of other sets and how many elements each one shares with S -- pick the 'best' 5.

    which is O(n*m*a), where a is the average number of sets each element is a member of.

How far removed from O(n*n) that is obviously depends on m and a.

Edit: Naive implementation in Python runs in 103 seconds on my machine...

old_time = clock()

my_sets = []

for i in range(10000):
    new_set = set()

    for j in range(200):
        new_set.add(randint(0, 999))

    my_sets.append(new_set)

my_inv_sets = [[] for i in range(1000)]

for i in range(len(my_sets)):
    for j in range(1000):
        if j in my_sets[i]:
            my_inv_sets[j].append(i)

for i in range(len(my_sets)):
    counter = Counter()

    for j in my_sets[i]:
        counter.update(my_inv_sets[j])

    print(counter.most_common(6)[1:])

print(clock() - old_time)

Upvotes: 2

Stefan Pochmann
Stefan Pochmann

Reputation: 28596

It's simpler and a bit faster (looks like 5-10% on the large case with limits 10000, 200, 1000) to use heapq.nlargest:

for i in range(len(my_sets)):
    results = nlargest(5,
                       (j for j in range(len(my_sets)) if j != i),
                       key=lambda j: len(my_sets[i] & my_sets[j]))
    print('Closest neighbors to set {} are sets {}'.format(i, results))

This doesn't build a heap with N-1 elements but with just 5 elements.

Upvotes: 0

Shubham Sharma
Shubham Sharma

Reputation: 71689

I've used set intersection to find the common elements, and then sort this elements based on the number of common elements they contain, Here's a code you might want to try:

from random import randint
from heapq import heappush, heappop

my_sets = []
for i in range(20):
    new_set = set()

    for j in range(20):
        new_set.add(randint(0, 50))

    my_sets.append(new_set)


for i in range(len(my_sets)):
    temp = dict()
    for j in range(len(my_sets)):
        if i == j:
            continue

        diff = my_sets[i] & my_sets[j]
        temp[j] = diff

    five = sorted(temp.items(), reverse=True, key=lambda s: len(s[1]))[:5]
    five_indexes = [t[0] for t in five]
    print('Closest neighbors to set {} are sets {}'.format(i, five_indexes))

Upvotes: 0

Related Questions