Reputation: 854

Fast set overlap matching algorithm

Let's say I have two sets:

A = [1, 3, 5, 7, 9, 11]

and

B = [1, 3, 9, 11, 12, 13, 14]

Both sets can be of arbitrary (and differing numbers of elements).

I am writing a performance critical application that requires me to perform a search to determine the number of elements which both sets have in common. I don't actually need to return the matches, only the number of matches.

Obviously, a naive method would be a brute force, but I suspect that is nowhere near optimal. Is there an algorithm for performing this type of operation?

If it helps, in all cases the sets will consists of integers.

Upvotes: 1

Answers (3)

UmNyobe

Reputation: 22890

If both set are sorted, the smallest element of both sets is either the minimum of the first set, or the minimum of second set. If it's the min of the first set, then the next smallest element is either the minimum of the second set or the 2nd minimum of first set. If you repeat this till the end of both sets you have ordered both set. For your specific problem you just need to compare if elements are also equals.

You can iterate over the union of both sets with the following algorithm:

intersection_set_cardinality(s1, s2)
{
   iterator i = begin(s1);
   iterator j = begin(s2);

   count = 0;
   while(i != end(s1) && j != end(s2))
   { 
       if(elt(i) == elt(j))
       {
            count = count + 1;
            i = i + 1;
            j = j + 1;
       }
       else if(elt(i) < elt(j))
       {
           i = i + 1;
       }
       else
       {
           j = j + 1;           
       }
   }
   return count
}

Upvotes: 1

Drathier

Reputation: 14519

If both sets are roughly the same size, walking over them in sync, similar to a merge sort merge operation, is about as fast as it gets.

Look at the first elements.
If they match, you add that element to your result, and move both pointers forward.
Otherwise, you move the pointer that points to the smallest value forward.

Some pseudo-Python:

a = []
b = []
res = []
ai = 0
bi = 0
while ai < len(a) and bi < len(b):
    if a[ai] == b[bi]:
        res += a[ai]
        ai+=1
        bi+=1
    elif a[ai] < b[bi]:
      ai+=1
    else:
      bi+=1
return res

If one set is significantly larger than the other, you can use binary search to look for each item from the smaller in the larger.

Upvotes: 2

FDavidov

Reputation: 3675

Here is the idea (very high level description though).

By the way, I'll take the liberty to assume that the numbers in each set are not appearing more than once, for instance [1,3,5,5,7,7,9,11] will not take place.

You define two variables that will hold the indices you are examining in each array.

You start with the first number of each set and compare them. Two possible conditions: they are equal or one is bigger than the other.

If they are equal, you count the event and move the pointers in both arrays to the next element.

If they differ, you move the pointer of the lower value to the next element in the array and repeat the process (compare both values).

The loop ends when you reach the last element of either array.

Hope I was able to explain it in a clear way.

Upvotes: 1

Fast set overlap matching algorithm

Answers (3)

Related Questions