Peter M
Peter M

Reputation: 844

How to find all pairs of sets (in a collection) that share at least M elements?

Given a large set of sets, all the same size (call it S={s1,...,sn}), I want to find all pairs (si,sj) that have an overlap of at least M.

So if M=2 and S consists of

I want to identify the pairs (s1,s2), (s2,s4), and (s3,s4).

The straightforward approach compares every pair and checks for the size of the intersection, but this is prohibitively slow given the number of sets and size of sets I am using (something like O(log(m) n2) where m is the size of the sets?).

I've searched around and haven't found a similar question (though this answer is probably relevant). Any help would be greatly appreciated!

Upvotes: 1

Views: 263

Answers (1)

Tomer W
Tomer W

Reputation: 3443

Go by value

pseudo code:

typedef V => type of values;

list<Set> S;
Map<V, list<int>> val2sets;
for(int setIdx=0; setIdx < S.length; ++setIdx){
    foreach(V v in S[setIdx]) {
        val2sets[v].push(setIdx);
    }
}

int[][] setIntersectCount;
foreach(V as v ,list<int> as l in val2sets) {
    for(int i=0; i < l.length; ++i){
        for(int j=i+1; j < l.length; ++j){
            setIntersectCount[i][j]++;
            if(setIntersectCount[i][j] == 2){
                printf("set {0} and {1} are a pair\n", i, j);
            }
        }
    }
}

as for Complexity:

let S = # of sets.
let M = # of members in a set. let V = # of unique values in the sets.

O(SM) generating the val2sets

as for reading all values and matching each 2..., it is maxed at O(V x S^2) but probability it will be closer to O(SM) because uniformity will say not all sets will have all elements in common.

P.S. hope i didn't have any mistake in my calculations.

Worth to note: I did not calculate the Naieve approach complexity either :)

Upvotes: 2

Related Questions