Efficient intersection operation between non-disjoint sets

Question

Does anybody knows of specific data-structures/algorithm to handle the following problem efficiently:

Given a set A and a set of sets S = {X,Y,Z..} I want to compute the size of the intersection between A and all sets in S exploiting the fact that most of them are non-disjoint i.e. share numbers.

For example: given A = {1,2...10}, X = {1,3,4,5,7} and Y = {2,4,5,7,9,10}, it's more efficient to compute the intersection between A and X intersect Y, A and X - X intersect Y, A and Y - X intersect Y and the sum up the results.

A practical example could be finding the number of occurrences of a keyword in a large set of documents that shares piece of text, (not the total, but per-document.)

Note that the only difference to Map-Reduce is that documents share parts of texts, and those parts should be parsed only once.

If this can be of any help, the way I am reasoning about the problem right now is of a graph/tree in which nodes are overlapping regions whose O(n) traversal gives the size of intersection between A and all elements of S. The problem I am facing is how to find the optimal set of nodes to be used. But maybe there are already off-the-shelf solutions for it.

Efficient intersection operation between non-disjoint sets

Answers (1)

Related Questions