Reputation: 1998

Finding the hundred largest numbers in a file of a billion

I went to an interview today and was asked this question:

Suppose you have one billion integers which are unsorted in a disk file. How would you determine the largest hundred numbers?

I'm not even sure where I would start on this question. What is the most efficient process to follow to give the correct result? Do I need to go through the disk file a hundred times grabbing the highest number not yet in my list, or is there a better way?

Upvotes: 36

Answers (14)

Adrian McCarthy

Reputation: 47962

There are lots of clever approaches (like the priority queue solutions), but one of the simplest things you can do can also be fast and efficient.

If you want the top k of n, consider:

allocate an array of k ints
while more input
  perform insertion sort of next value into the array

This may sound absurdly simplistic. You might expect this to be O(n^2), but it's actually only O(k*n), and if k is much smaller than n (as is postulated in the problem statement), it approaches O(n).

You might argue that the constant factor is too high because doing an average of k/2 comparisons and moves per input is a lot. But most values will be trivially rejected on the first comparison against the kth largest value seen so far. If you have a billion inputs, only a small fraction are likely to be larger than the 100th so far.

(You could construe a worst-case input where each value is larger than its predecessor, thus requiring k comparisons and moves for every input. But that is essentially a sorted input, and the problem statement said the input is unsorted.)

Even the binary-search improvement (to find the insertion point) only cuts the comparisons to ceil(log_2(k)), and unless you special case an extra comparison against the kth-so-far, you're much less likely to get the trivial rejection of the vast majority of inputs. And it does nothing to reduce the number of moves you need. Given caching schemes and branch prediction, doing 7 non-consecutive comparisons and then 50 consecutive moves doesn't seem likely to be significantly faster than doing 50 consecutive comparisons and moves. It's why many system sorts abandon Quicksort in favor of insertion sort for small sizes.

Also consider that this requires almost no extra memory and that the algorithm is extremely cache friendly (which may or may not be true for a heap or priority queue), and it's trivial to write without errors.

The process of reading the file is probably the major bottleneck, so the real performance gains are likely to be by doing a simple solution for the selection, you can focus your efforts on finding a good buffering strategy for minimizing the i/o.

If k can be arbitrarily large, approaching n, then it makes sense to consider a priority queue or other, smarter, data structure. Another option would be to split the input into multiple chunks, sort each of them in parallel, and then merge.

Upvotes: 0

FranMowinckel

Reputation: 4343

Here is another solution (about an eon later, I have no shame sorry!) based on the second one provided by @paxdiablo. The basic idea is that you should read another k numbers only if they're greater than the minimum you already have and that sorting is not really necessary:

// your variables
n = 100
k = a number > n and << 1 billion
create array1[n], array2[k]

read first n numbers into array2
find minimum and maximum of array2 
while more numbers:
  if number > maximum:
    store in array1
    if array1 is full: // I don't need contents of array2 anymore
       array2 = array1
       array1 = []
  else if number > minimum:
    store in array2
    if array2 is full:
       x = n - array1.count()
       find the x largest numbers of array2 and discard the rest
       find minimum and maximum of array2
  else:
    discard the number
endwhile

// Finally
x = n - array1.count()
find the x largest numbers of array2 and discard the rest
return merge array1 and array2

The critical step is the function for finding the largest x numbers in array2. But you can use the fact, that you know the minimum and maximum to speed up the function for finding the largest x numbers in array2.

Actually, there are lots of possible optimisations since you don't really need to sort it, you just need the x largest numbers.

Furthermore, if k is big enough and you have enough memory, you could even turn it into a recursive algorithm for finding the n largest numbers.

Finally, if the numbers are already sorted (in any order), the algorithm is O(n).

Obviously, this is just theoretically because in practice you would use standard sorting algorithms and the bottleneck would probably be the IO.

Upvotes: 0

Pushpendre

Reputation: 835

Here's some python code which implements the algorithm suggested by ferdinand beyer above. essentially it's a heap, the only difference is that deletion has been merged with insertion operation

import random
import math

class myds:
""" implement a heap to find k greatest numbers out of all that are provided"""
k = 0
getnext = None
heap = []

def __init__(self, k, getnext ):
    """ k is the number of integers to return, getnext is a function that is called to get the next number, it returns a string to signal end of stream """
    assert k>0
    self.k = k
    self.getnext = getnext


def housekeeping_bubbleup(self, index):
    if index == 0:
        return()

    parent_index = int(math.floor((index-1)/2))
    if self.heap[parent_index] > self.heap[index]:
        self.heap[index], self.heap[parent_index] = self.heap[parent_index], self.heap[index]
    self.housekeeping_bubbleup(parent_index)
    return()

def insertonly_level2(self, n):
    self.heap.append(n)
    #pdb.set_trace()
    self.housekeeping_bubbleup(len(self.heap)-1)

def insertonly_level1(self, n):
    """ runs first k times only, can be as slow as i want """
    if len(self.heap) == 0:
        self.heap.append(n)
        return()
    elif n > self.heap[0]:
        self.insertonly_level2(n)
    else:
        return()

def housekeeping_bubbledown(self, index, length):
    child_index_l = 2*index+1
    child_index_r = 2*index+2
    child_index = None
    if child_index_l >= length and child_index_r >= length: # No child
        return()
    elif child_index_r >= length: #only left child
        if self.heap[child_index_l] < self.heap[index]: # If the child is smaller
            child_index = child_index_l
        else:
            return()
    else: #both child
        if self.heap[ child_index_r] < self.heap[ child_index_l]:
            child_index = child_index_r
        else:
            child_index = child_index_l

    self.heap[index], self.heap[ child_index] = self.heap[child_index], self.heap[index]
    self.housekeeping_bubbledown(child_index, length)
    return()

def insertdelete_level1(self, n):
    self.heap[0] = n
    self.housekeeping_bubbledown(0, len(self.heap))
    return()

def insert_to_myds(self,  n ):
    if len(self.heap) < self.k:
        self.insertonly_level1(n)
    elif n > self.heap[0]:
        #pdb.set_trace()
        self.insertdelete_level1(n)
    else:
        return()

def run(self ):
    for n in self.getnext:
        self.insert_to_myds(n)
        print(self.heap)
        #            import pdb; pdb.set_trace()
    return(self.heap)

def createinput(n):
    input_arr = range(n)
    random.shuffle(input_arr)
    f = file('input', 'w')
    for value in input_arr:
        f.write(str(value))
        f.write('\n')

input_arr = []
with open('input') as f:
    input_arr = [int(x) for x in f]
myds_object = myds(4, iter(input_arr))
output = myds_object.run()
print output

Upvotes: 1

prasath raman

Reputation: 41

Assuming that 1 bill + 100ion numbers fit into memory the best sorting algorithm is heap sort. form a heap and get the first 100 numbers. complexity o(nlogn + 100(for fetching first 100 numbers))

improving the solution

divide the implementaion to two heap(so that insertion are less complex) and while fetching the first 100 elements do imperial merge algorithm.

Upvotes: 1

Ferdinand Beyer

Reputation: 67157

Obviously the interviewers want you to point out two key facts:

You cannot read the whole list of integers into memory, since it is too large. So you will have to read it one by one.
You need an efficient data structure to hold the 100 largest elements. This data structure must support the following operations:
- Get-Size: Get the number of values in the container.
- Find-Min: Get the smallest value.
- Delete-Min: Remove the smallest value to replace it with a new, larger value.
- Insert: Insert another element into the container.

By evaluating the requirements for the data structure, a computer science professor would expect you to recommend using a Heap (Min-Heap), since it is designed to support exactly the operations we need here.

For example, for Fibonacci heaps, the operations Get-Size, Find-Min and Insert all are O(1) and Delete-Min is O(log n) (with n <= 100 in this case).

In practice, you could use a priority queue from your favorite language's standard library (e.g. priority_queue from#include <queue> in C++) which is usually implemented using a heap.

Upvotes: 53

The Archetypal Paul

Reputation: 41759

I think someone should have mentioned a priority queue by now. You just need to keep the current top 100 numbers, know what the lowest is and be able to replace that with a higher number. That's what a priority queue does for you - some implementations may sort the list, but it's not required.

Upvotes: 1

paxdiablo

Reputation: 881563

Here's my initial algorithm:

create array of size 100 [0..99].
read first 100 numbers and put into array.
sort array in ascending order.
while more numbers in file:
    get next number N.
    if N > array[0]:
        if N > array[99]:
            shift array[1..99] to array[0..98].
            set array[99] to N.
        else
            find, using binary search, first index i where N <= array[i].
            shift array[1..i-1] to array[0..i-2].
            set array[i-1] to N.
        endif
    endif
endwhile

This has the (very slight) advantage is that there's no O(n^2) shuffling for the first 100 elements, just an O(n log n) sort and that you very quickly identify and throw away those that are too small. It also uses a binary search (7 comparisons max) to find the correct insertion point rather than 50 (on average) for a simplistic linear search (not that I'm suggesting anyone else proffered such a solution, just that it may impress the interviewer).

You may even get bonus points for suggesting the use of optimised shift operations like memcpy in C provided you can be sure the overlap isn't a problem.

One other possibility you may want to consider is to maintain three lists (of up to 100 integers each):

read first hundred numbers into array 1 and sort them descending.
while more numbers:
    read up to next hundred numbers into array 2 and sort them descending.
    merge-sort lists 1 and 2 into list 3 (only first (largest) 100 numbers).
    if more numbers:
        read up to next hundred numbers into array 2 and sort them descending.
        merge-sort lists 3 and 2 into list 1 (only first (largest) 100 numbers).
    else
        copy list 3 to list 1.
    endif
endwhile

I'm not sure, but that may end up being more efficient than the continual shuffling.

The merge-sort is a simple selection along the lines of (for merge-sorting lists 1 and 2 into 3):

list3.clear()
while list3.size() < 100:
    while list1.peek() >= list2.peek():
        list3.add(list1.pop())
    endwhile
    while list2.peek() >= list1.peek():
        list3.add(list2.pop())
    endwhile
endwhile

Simply put, pulling the top 100 values out of the combined list by virtue of the fact they're already sorted in descending order. I haven't checked in detail whether that would be more efficient, I'm just offering it as a possibility.

I suspect the interviewers would be impressed with the potential for "out of the box" thinking and the fact that you'd stated that it should be evaluated for performance.

As with most interviews, technical skill is one of the the things they're looking at.

Upvotes: 17

Howard May

Reputation: 6649

I believe the quickest way to do this is by using a very large bit map to record which numbers are present. In order to represent a 32 bit integer this would need to be 2^32 / 8 bytes which is about == 536MB. Scan through the integers simply setting the corresponding bit in the bit map. Then look for the highest 100 entries.

NOTE: This finds the highest 100 numbers not the highest 100 instances of a number if you see the difference.

This kind of approach is discussed in the very good book Programming Pearls which your interviewer may have read!

Upvotes: 3

JoshD

Reputation: 12814

I'd traverse the list in order. As I go, I add elements to a set (or multiset depending on duplicates). When the set reached 100, I'd only insert if the value was greater than the min in the set (O(log m)). Then delete the min.

Calling the number of values in the list n and the number of values to find m:

this is O(n * log m)

Upvotes: 8

Tom Gullen

Reputation: 61727

You are going to have to check every number, there is no way around that.

Just as a slight improvement on solutions offered,

Given a list of 100 numbers:

You would check to see if the new found value is > min value of our array, if it is, insert it. However doing a search from bottom to top can be quite expensive, and you may consider taking a divide and conquer approach, by for example evaluating the 50th item in the array and doing a comparison, then you know if the value needs to be inserted in the first 50 items, or the bottom 50. You can repeat this process for a much faster search as we have eliminated 50% of our search space.

Also consider the data type of the integers. If they are 32 bit integers and you are on a 64 bit system, you may be able to do some clever memory handling and bitwise operations to deal with two numbers on disk at once if they are continual in memory.

Upvotes: 1

Sergey Bankevich

Reputation: 136

If you find 100th order statistic using quick sort, it will work in average O(billion). But I doubt that with such numbers and due to random access needed for this approach it will be faster, than O(billion log(100)).

Upvotes: 0

ruslik

Reputation: 14880

Speed of the processing algorithm is absolutely irrelevant (unless it's completely dumb).

The bottleneck here is I/O (it's specified that they are on disk). So make sure that you work with large buffers.

Upvotes: 7

Aliostad

Reputation: 81660

Keep a fixed array of 100 integers. Initialise them to a Int.MinValue. When you are reading, from 1 billion integers, compare them with the numbers in the first cell of the array (index 0). If larger, then move up to next. Again if larger, then move up until you hit the end or a smaller value. Then store the value in the index and shift all values in the previous cells one cell down... do this and you will find 100 max integers.

Upvotes: 3

Goz

Reputation: 62323

Create an array of 100 numbers all being -2^31.

Check if the the first number you read from disk is greater than the first in the list. If it is copy the array down 1 index and update it to the new number. If not check the next in the 100 and so on.

When you've finished reading all 1 billion digits you should have the highest 100 in the array.

Job done.

Upvotes: 10

Finding the hundred largest numbers in a file of a billion

Answers (14)

Related Questions