Top N record sort to return only the numbers that fall in particular range in the sorted array

Question

I have a hashtable that may contain around 1-5 million records. I need to iterate over it to select some of the entries in it and then sort them in some order. Also, I don't actually the need all the sorted elements, but say some elements that fall in a particular range. For eg. if the hashtable contains 1 million records, I may require only top 1000th-2000th records. Is there a standard sort algorithm for this? If not may be I can implement a 2 pass algorithm, first pass to find the 1000th record and say next pass to determine what next next 1000 records could be. Is there any implementation of heap sort available for this? Any implementation of topN heap sort would help for this too I guess. There is also a constraint on the size of the array I can use. It would be the best, if I have the array size to be only the number of records requested.

Jim Mischel · Accepted Answer

There is a standard algorithm for this. As @Ivaylo Strandjev pointed out in his answer, quickselect is the typical way this is done. However, that will require O(n) extra space in this case because you can't rearrange the hash table.

The other way commonly used is a heap. If you want the 1000th to 2000th smallest items, you do it in two passes. First, you use the heap to find the 1000th smallest item. Then you do another pass to get the 1000 smallest items that are greater than or equal to the one you found in the first pass. Something like:

First pass:

h = new max heap
add first 1000 items to heap
for i = 1000 to n-1
    if hashtable[i] < heap.peek()
        heap.remove_first()
        heap.add(item)

The 1000th smallest item is now the largest item on the heap. So:

smallest = heap.remove_first()

Now make a new heap of size 1,000 and add that item. Then go through the list again and only add items that are >= smallest:

h = new max heap
for i = 0 to n-1
    if hashtable[i] >= smallest
        if heap.count < 1000
            heap.add(item)
        else if hashtable[i] < heap.peek()
            heap.remove_first()
            heap.add(item)

When you're done, the heap contains items 1000 to 2000.

One nice thing about this technique is that you can allocate a single array and manipulate it like a heap. See Binary heap for an overview, or check out my series on heaps. The first three articles explain the problem and provide enough information that you should be able to implement a simple heap in C. There is a code example in C# you can use as a starting point.

Remember, though, that I'm using a Max-heap in the example above. The discussions in my blog post is about Min-heap, so you'll have to be sure to change the comparisons accordingly.

Complexity of heap selection is O(n log k), where n is the size of the list and k is the number of items to be selected. If you want the items to be sorted when you return them in the array, you'll have to sort the heap (or remove the items one-by-one). In either case, that's O(k log k). So your total running time will be proportional to (n log k) + (k log k).

Although quickselect is asymptotically faster than heap select, in practice heap select is generally faster than quickselect when you're selecting 1% or fewer of the items. So if you're selecting 1,000 items from a list of a million, heap selection will almost certainly be faster. See When theory meets practice for details.

Update

As the OP pointed out in comments, this can require a rather large heap. For example if you want to find the 10,000th through 11,000th items, the first pass requires a heap of size 10,000. The second pass still requires only a heap of 1,000 items. Using the naive method, selecting the 990,000th through 999,000th items would require a heap of size 998,000. However, you could turn it around: use a Min-heap and then you'd only need a heap of size 1,000 again (i.e. find the 1,000th through 2,000th largest items). So the heap size you need for the first pass is min(r1, (n-r2)). The heap size required for the second pass is r2 - r1.

You could do it by creating a single heap of size r2. Do a pass to get the smallest r2 items, and then select the top r2-r1 items from that. It makes the complexity (n log r2) + (k log r2). And k is equal to r2-r1. Again, you could use Min-heap to reduce your memory requirement if r1 > n/2.

I considered the possibility of doing this in a single pass with a Min-max heap, but I don't think it'll work.

Top N record sort to return only the numbers that fall in particular range in the sorted array

Answers (2)

Update

Related Questions