Reputation: 844
Given a large set of sets, all the same size (call it S={s1,...,sn}), I want to find all pairs (si,sj) that have an overlap of at least M.
So if M=2 and S consists of
I want to identify the pairs (s1,s2), (s2,s4), and (s3,s4).
The straightforward approach compares every pair and checks for the size of the intersection, but this is prohibitively slow given the number of sets and size of sets I am using (something like O(log(m) n2) where m is the size of the sets?).
I've searched around and haven't found a similar question (though this answer is probably relevant). Any help would be greatly appreciated!
Upvotes: 1
Views: 263
Reputation: 3443
Go by value
pseudo code:
typedef V => type of values;
list<Set> S;
Map<V, list<int>> val2sets;
for(int setIdx=0; setIdx < S.length; ++setIdx){
foreach(V v in S[setIdx]) {
val2sets[v].push(setIdx);
}
}
int[][] setIntersectCount;
foreach(V as v ,list<int> as l in val2sets) {
for(int i=0; i < l.length; ++i){
for(int j=i+1; j < l.length; ++j){
setIntersectCount[i][j]++;
if(setIntersectCount[i][j] == 2){
printf("set {0} and {1} are a pair\n", i, j);
}
}
}
}
as for Complexity:
let S = # of sets.
let M = # of members in a set.
let V = # of unique values in the sets.
O(SM)
generating the val2sets
as for reading all values and matching each 2...,
it is maxed at O(V x S^2) but probability it will be closer to O(SM)
because uniformity will say not all sets will have all elements in common.
P.S. hope i didn't have any mistake in my calculations.
Worth to note: I did not calculate the Naieve approach complexity either :)
Upvotes: 2