memecs
memecs

Reputation: 7574

Efficient intersection operation between non-disjoint sets

Does anybody knows of specific data-structures/algorithm to handle the following problem efficiently:

Given a set A and a set of sets S = {X,Y,Z..} I want to compute the size of the intersection between A and all sets in S exploiting the fact that most of them are non-disjoint i.e. share numbers.

For example: given A = {1,2...10}, X = {1,3,4,5,7} and Y = {2,4,5,7,9,10}, it's more efficient to compute the intersection between A and X intersect Y, A and X - X intersect Y, A and Y - X intersect Y and the sum up the results.

A practical example could be finding the number of occurrences of a keyword in a large set of documents that shares piece of text, (not the total, but per-document.)

Note that the only difference to Map-Reduce is that documents share parts of texts, and those parts should be parsed only once.

If this can be of any help, the way I am reasoning about the problem right now is of a graph/tree in which nodes are overlapping regions whose O(n) traversal gives the size of intersection between A and all elements of S. The problem I am facing is how to find the optimal set of nodes to be used. But maybe there are already off-the-shelf solutions for it.

Upvotes: 2

Views: 206

Answers (1)

smossen
smossen

Reputation: 887

If you expect big overlaps then it might be worth storing the sets as treaps with unique representation of nodes. This should be faster than anything else if the overlaps are big enough.

See the following answer: https://cs.stackexchange.com/a/18006/10483

Upvotes: 0

Related Questions