Evgeny Lazin
Evgeny Lazin

Reputation: 9413

Binary search in many sorted arrays

I have many arrays with sorted data. I need to perform binary search in this arrays. If key ranges in this arrays were disjoint it will be possible to sort arrays by range and then perform binary search as with single array. But in my case, key-ranges in this arrays can overlap. In this case it is only possible to perform filtering to exclude some arrays and then sort the other part. In my case most of the arrays doesn't overlap so filtering, most of the time, will return only one array but it is still possible for bad data to ruin performance.

Is it possible to use better algorithm in this case? It is possible to modify arrays slightly, add some metadata or links to other arrays.

Update This arrays is data pages backed by disk storage. I use memory mapped files for this. I can sort data inside page very fast, because copying doesn't involved in this process. But to merge two pages I need to copy large amount of data between pages. I have very large amount of data, terabytes! But each page is only 8Mb so it can be searched quickly. New pages added to the storage from time to time. Pages contains time-series data so it already partially sorted and new arrays doesn't overlap with old data most of the time.

Upvotes: 3

Views: 603

Answers (5)

sidquanto
sidquanto

Reputation: 315

Maintain multiple groups of arrays which have disjoint ranges.

When performing the binary search, do it in parallel over these groups or try it over groups based on smallest first.

For every group, maintain the ranges and whenever a new page arrives, attach it to the largest group which does not have a disjoint range with this new page. If the page does not belong to any of the groups, create a new one.

As you said that most of the times the ranges do not overlap, the chances of having these extra groups is quite less and yet the algorithm can adapt when such an anomaly occurs.

Upvotes: 1

Gene
Gene

Reputation: 46960

You imply that you have many queries on mostly static data, so I'll assume that. You're on the right track. Only don't exclude overlapping arrays. Track the overlaps. Here is how. Begin by compiling an index of ranges. If the arrays were disjoint, they would be the blocks. When you have two arrays overlapping:

|     A    |
     |       B       |

Split into three ranges:

| A  | AB  |   B     |

As the diagram implies, the index of ranges just records low and high bounds and a list of arrays that cover the range.

Now search the index (in memory) to determine which array or arrays to search. Then go search all those. As a further optimization, can use the block boundaries to limit the array search. In other words, if you get block AB above, you can rule out part of A and part of B as you search them.

How to compile and update the index efficiently? I suggest an interval tree. This page provides pseudocode. If you are programming in C++, you can use the relevant Boost library to good advantage.

With interval trees, each array is an interval. When you query the tree with a point, you get back all the relevant intervals. These are the arrays that much be searched.

Upvotes: 2

jboi
jboi

Reputation: 11892

Terabytes in 8MB pages means you have the handle a few million pages. Each page is sorted inside, and values in the pages can (rarely, but they can) overlap each other.

I would expect that the impact on finding the right page is higher then find the right entry within a page.

Therefore I recommend the following approach:

  • Maintain an array with lowest and highest key per page (lowestPageKey, highestPageKey).
  • Do a binary search to get the fitting pages and do a second binary search within the page.
  • For finding the fitting pages on searchKey do a range fitting binary search in the meta data.
    • Use the condition lowestPageKey <= searchKey <= highestPageKey to find the right page.
    • If lowestPageKey > searchKey you can continue with the lower half of the array
    • If highestPageKey < searchKey you can continue with the higher half of the array

This way you'll find the right page(s) and can issue a second binary search within the found pages.

One more question from my side: If values in pages overlap, you can find more then one entry (or more pages) that contain the search key. What do you expect back in such a case? One page/entry randomly, all pages/entries, the first/last page/entry or an error message?

Upvotes: 2

Bergi
Bergi

Reputation: 664185

If key ranges in this arrays were disjoint it will be possible to sort arrays by range and then perform binary search as with single array. But in my case, key-ranges in this arrays can overlap.

You still can sort them then. Instead of naively filtering all arrays by their boundaries, you can use a interval tree to store them and to retrieve the to-be-searched arrays in logarithmical time. Since you have a lot of arrays and they only seldom overlap each other, this should give a noticeable performance boost.

Upvotes: 4

Ivaylo Strandjev
Ivaylo Strandjev

Reputation: 70931

If you only plan on performing a few queries I don't think you can improve your algorithm - I believe it is already quite good. If you expect to perform a lot of queries I would advice you to merge the arrays to a single one and perform binary search over it. Merge is just the same algorithm that is part of merge sort and is linear. So as long as the number of queries makes up for the linear merge it is worth it.

Upvotes: 2

Related Questions