Reputation: 9413
I have many arrays with sorted data. I need to perform binary search in this arrays. If key ranges in this arrays were disjoint it will be possible to sort arrays by range and then perform binary search as with single array. But in my case, key-ranges in this arrays can overlap. In this case it is only possible to perform filtering to exclude some arrays and then sort the other part. In my case most of the arrays doesn't overlap so filtering, most of the time, will return only one array but it is still possible for bad data to ruin performance.
Is it possible to use better algorithm in this case? It is possible to modify arrays slightly, add some metadata or links to other arrays.
Update This arrays is data pages backed by disk storage. I use memory mapped files for this. I can sort data inside page very fast, because copying doesn't involved in this process. But to merge two pages I need to copy large amount of data between pages. I have very large amount of data, terabytes! But each page is only 8Mb so it can be searched quickly. New pages added to the storage from time to time. Pages contains time-series data so it already partially sorted and new arrays doesn't overlap with old data most of the time.
Upvotes: 3
Views: 603
Reputation: 315
Maintain multiple groups of arrays which have disjoint ranges.
When performing the binary search, do it in parallel over these groups or try it over groups based on smallest first.
For every group, maintain the ranges and whenever a new page arrives, attach it to the largest group which does not have a disjoint range with this new page. If the page does not belong to any of the groups, create a new one.
As you said that most of the times the ranges do not overlap, the chances of having these extra groups is quite less and yet the algorithm can adapt when such an anomaly occurs.
Upvotes: 1
Reputation: 46960
You imply that you have many queries on mostly static data, so I'll assume that. You're on the right track. Only don't exclude overlapping arrays. Track the overlaps. Here is how. Begin by compiling an index of ranges. If the arrays were disjoint, they would be the blocks. When you have two arrays overlapping:
| A |
| B |
Split into three ranges:
| A | AB | B |
As the diagram implies, the index of ranges just records low and high bounds and a list of arrays that cover the range.
Now search the index (in memory) to determine which array or arrays to search. Then go search all those. As a further optimization, can use the block boundaries to limit the array search. In other words, if you get block AB above, you can rule out part of A and part of B as you search them.
How to compile and update the index efficiently? I suggest an interval tree. This page provides pseudocode. If you are programming in C++, you can use the relevant Boost library to good advantage.
With interval trees, each array is an interval. When you query the tree with a point, you get back all the relevant intervals. These are the arrays that much be searched.
Upvotes: 2
Reputation: 11892
Terabytes in 8MB pages means you have the handle a few million pages. Each page is sorted inside, and values in the pages can (rarely, but they can) overlap each other.
I would expect that the impact on finding the right page is higher then find the right entry within a page.
Therefore I recommend the following approach:
lowestPageKey
, highestPageKey
).searchKey
do a range fitting binary search in the meta data.
lowestPageKey <= searchKey <= highestPageKey
to find the right page.lowestPageKey > searchKey
you can continue with the lower half of the arrayhighestPageKey < searchKey
you can continue with the higher half of the arrayThis way you'll find the right page(s) and can issue a second binary search within the found pages.
One more question from my side: If values in pages overlap, you can find more then one entry (or more pages) that contain the search key. What do you expect back in such a case? One page/entry randomly, all pages/entries, the first/last page/entry or an error message?
Upvotes: 2
Reputation: 664185
If key ranges in this arrays were disjoint it will be possible to sort arrays by range and then perform binary search as with single array. But in my case, key-ranges in this arrays can overlap.
You still can sort them then. Instead of naively filtering all arrays by their boundaries, you can use a interval tree to store them and to retrieve the to-be-searched arrays in logarithmical time. Since you have a lot of arrays and they only seldom overlap each other, this should give a noticeable performance boost.
Upvotes: 4
Reputation: 70931
If you only plan on performing a few queries I don't think you can improve your algorithm - I believe it is already quite good. If you expect to perform a lot of queries I would advice you to merge the arrays to a single one and perform binary search over it. Merge is just the same algorithm that is part of merge sort and is linear. So as long as the number of queries makes up for the linear merge it is worth it.
Upvotes: 2