Sean
Sean

Reputation: 117

Find n largest values in large unsorted list, working with only a page of values at a time

I'm struggling conceptually.

How can I write a program to find the highest 10,000 numbers out of a list approximately 2 billion in size? Assume that the computer only has enough capacity to work with roughly 10,000 numbers at a time (out of the 2 billion). Excluding any overhead of the program itself, it's assumed I'd have exactly enough room in main memory to handle 10,000 numbers at a time.

It was suggested that I use a heap to handle the information but I'm not seeing how to do this when I can't sort ALL the numbers at once.

Upvotes: 2

Views: 218

Answers (1)

Adrian Leonhard
Adrian Leonhard

Reputation: 7370

  1. Add the first 10,00 numbers to a result list. (Keep this list sorted for further steps)
  2. Iterate through the rest of the 2 billion numbers; for each one check if it is bigger than the lowest number in the result list, if yes, replace the lowest number with this one.

This way you only need to keep 10,000 numbers in memory at the same time.

EDIT 25/2/15:

Assuming n = result size, m = input size, the number of times a number will have to replaced in the results list, calculated here Number of assignments necessary to find the minimum value in an array? for the case n = 1, can be extended to this case:

double averageReplacementCount = 0;
for (int i = n; i < m; i++) {
    averageReplacementCount += 1.0 / (i + 1);
}

For n = 10,000 and m = 2,000,000,000 this is only ~12.206 (< 13 !).

This only applies if the numbers are uniformly distributed. If they are descending, no replacements will be needed, but if they are ascending (worst-case scenario!), (m-n) replacements will be required.

This makes the choice of data structure for the result list potentially unimportant, as long as the minimum value is cached and can be compared in constant time.

Upvotes: 3

Related Questions