efficient algorithm to intersect m ordered sets in memory?

Suppose we have m ordered sets and we want to find their intersection.

Which data structures should we use for the ordered sets and which algorithm would be the most efficient?

same question: Algorithm for N-way merge

It appears that the literature is huge. Thus a better question is this: Which good implementations are there?

Upvotes: 3

Views: 806

Answers (2)

This is only a sketch: please help me improve it.

This solution will be based on using binary search to limit the search to n/2^i number of elements for each set, and I will use efficient data structures to remember those comparisons for the next number.

The first thing to note is that the balanced binary tree is good at performing binary search, only when the interval of the search closely match that of a (sub)tree.

The other 2 structures that accept binary search is the array and the skip list. The array is inefficient for insertions and deletion, so the skip list seems the best choice.

We will need m arrays of size 64 that will contain the elements of each set per array that were compared in the binary search, inserted in the order of execution of the comparison.

We will also need a double linked list in which all the elements from all the sets that were used in the binary search will be inserted. Using a skip list here minimizes even more the number of comparisons needed.

The basic idea is this.

  1. We search the minimum element in each set with a binary search.
  2. In each binary search step, we add the new element in the array of the set and in the double linked list.
  3. Either there is a common minimum element or not.
  4. we remove the smallest element in the double linked list. New searches will start from the previous element in the binary search array of the set and will use half the distance than before. we use the previous binary search elements in the arrays to limit the new search to the smallest known interval.
  5. continue to 1.

Upvotes: 0

ony
ony

Reputation: 13223

You can create binary tree with link to parent node and implement classic algorithm of intersection/union:

  1. Set iterA to left-most (smallest) node of the tree (i.e. descend over left-most branches to leaf).
  2. Set iterB to first (smallest) node of the ordered set (if it is implemented with ordered array or to the left-most node if as tree).
  3. Branch by comparison of items pointed by iterA and iterB
    • If lower: yield item of union and advance iterA
    • If equals: yield item of union and item of intersection and advance both iterA and itemB
    • If greater: yield item of union and advance iterB
  4. Repeat until one of iterator is unable to advance
  5. Rest of items accessible from other iterator yield as union items

Advance of binary tree iterator by:

  • If current node have right child descend to it and descend to the left-most child of it while possible. Yield that item.
  • If node have parent ascend over it and repeat that while we ascending from right child. Yield that item.
  • Otherwise: all items of tree is yielded already (end of collection).

Update: If you know that your ordered set (walked by iterB) is much smaller than the tree you can use a bit more sophisticated algorithm for intersection:

  1. Initially set iterB to beginning of ordered set (lower value).
  2. Set iterA to the node which is minimum upper bound of value iterB.
  3. Branch by comparison of items pointed by iterA and iterB
    • If equals: yield item of intersection.
  4. Advance itemB to the next value.
  5. Advance iterA to the minimum upper bound of value at itemB starting from current itemA.
  6. Repeat until itemB pass all items of ordered set.

Where advance to the minimum upper bound from specific node is:

  • If value of current node less than target
    • Find upper bound on right child by walking right-children of each node
    • If even right-most node of that branch is lower than target: ascend to parent while moving from right child and restart from that node.
    • Else from node where we found upper bound
    • Find first left-most children value of which is less than target
      • If not found: left-most leaf of that branch is minimum upper bound
      • Else restart from that node (more precisely will be to use sub-algorithm which walk over left-most and right-most nodes narrowing borders).

The main idea of searching bound is narrowing upper and minimum bound ("-" is ignored nodes, "..." is new search range):

for B < X < A
    U
   / \-
  L
-/ \...

for A < X < B
  L
-/ \
    U
.../ \-

Upvotes: 1

Related Questions