Reputation: 131

maximal intersection between n sets

I have x sets with y elements(unsorted integers) in each of them. I want to find maximal size of intersection between pair of this sets.

For example:

*5 sets, size = 3

set 1 : 1 2 3

set 2 : 4 2 3

set 3 : 5 6 7

set 4 : 5 8 9

set 5 : 5 10 11

maximal intersection have set 1 with set 2 and it's size is 2; the answer is 2.

So, I can do it in O(x^2 * y) using HashSets, by just looking out all pairs and calculating their intersection size. But I want to do it faster. I think there are specific algorithm or data structure that can help. Can you give me some idea?

UPDATE : x and y is about 10^3, elements are int. And there are no equals sets.

Upvotes: 13

Answers (3)

Joel

Reputation: 2824

Here is some psuedocode:

function max_intersection(vector<vector<int>> sets):
    hashmap<int, vector<set_id>> val_map;
    foreach set_id:set in sets:
        foreach val in set:
            val_map[val].push_back(set_id);
    max_count = 0
    vector<int> counts = vector<int>(size = sets.size() * sets.size(), init_value = 0);
    foreach val:set_ids in val_map:
        foreach id_1:set_id_1 in set_ids:
            foreach id_2:set_id_2 in set_ids where id_2 > id_1:
                count = ++counts[set_id_1 * sets.size() + set_id_2];
                if (count > max_count):
                    max_count = count;
    return max_count;

So if X is the number of sets and Y is the number of elements in each set:

Inserting into val_map is O(X*Y)
Creating counts and initializing each element to zero is O(X^2)
If there are no intersections (each value occurs exactly once), the last loop runs in time O(X*Y). However, at the other extreme, if there is a large number of intersections (all sets are equivalent), then the last loop runs in O(X^2*Y).

So depending on the amount of intersections, the time complexity is somewhere between O(X*Y + X^2) and O(X^2*Y).

Upvotes: 4

Ivaylo Strandjev

Reputation: 70929

I can not think of a solution that will improve O(x*x*y), but I can suggest a way to avoid hashing and instead of expected complexity O(x*x*y) to have complexity O(x*x*y) at the cost of 10^6 additional memory. Looking at the constraints you've provided you will have no more than 10^6 different numbers. So my idea is the following - sort all the numbers and then unique them(remove duplicates). Assign unique number from 1 to 10^6(or the number of unique numbers) to each of the numbers(using their order in the sorted and uniqued array). After that instead of hashmap on for each pair, use a bit set of size 10^6. This way you will have a certain complexity of O(x*x*y)(as the precomputation I propose is of complexity O(x * y *(log(x) + log (y))).

Upvotes: 2

kostek

Reputation: 801

One optimisation that I can think of is remembering intersection size between the first set and the rest of them and then use the data to cut some cases.

How you can use it:

If you have sets A, B, C of length n and

intersection(A,B) = p
intersection(A,C) = q

then

intersection(B,C) <= n - abs(p - q)

For sets in your case:

S0 = { 1 2 3 }
S1 = { 4 2 3 }
S2 = { 5 6 7 }

you compute intersection(S0,S1) = 2 and remember the result:

[ i(0,1)=2 ]

then intersection(S0,S2) = 0, so

[ i(0,1)=2; i(0,2)=0 ]

And when you compute intersection(S1,S2) after comparing first elements

(S1[0]=4 != S2[0]=5)

you can say that intersection(S1,S2) <= 2 that is the best result that you have so far.

What can be further improvement is to remember more exact results of intersections but still not computing all of them.

I'm not sure if that's the best option. Maybe there exists totally different approach to this.

Upvotes: 4

maximal intersection between n sets

Answers (3)

Related Questions