Reputation: 3645

How to efficiently find 10 greatest numbers from billions of numbers?

Problem Statement: Find 10 maximum numbers from a file which contains billions of numbers

Input: 97911 98855 12345 78982 ..... .....

I actually came up with the below solution which has

best case complexity O(n) - When file has numbers in descending order
worst case complexity O(n*10) ~ O(n) When the file has numbers in ascending order
Average complexity ~ O(n)

Space complexity is O(1) in all cases

I am reading the file using a file reader and an sorted Array which stores the maximum 10 numbers. I will check if the currentLine is greater than the smallest element in the array - If so will insert it in the correct position by swapping.

Scanner sc = new Scanner(new FileReader(new File("demo.txt")));
int[] maxNum = new int[10];
    while(sc.hasNext()){
    int phoneNumber = Integer.parseInt(sc.nextLine());
    if(phoneNumber>maxNum[9]){
        maxNum[9] = phoneNumber;
        for(int i =9;i>0;i--){
            if(maxNum[i]>maxNum[i-1]){
                int temp = maxNum[i];
                maxNum[i] = maxNum[i-1];
                maxNum[i-1] = temp;
            }
        }
    }
    }

I am looking for feedback if there are better ways to implement this

Upvotes: 4

Answers (3)

Saeed Amiri

Reputation: 22555

You can improve the algorithm with multi threading and parallelization. It means run e.g. 20 threads, and partition the file into 20 files and in each part find the largest 10 numbers. At the end find the largest 10 numbers among those 20 arrays (each of length 10) that you maintained.

The point is that the operation is reading from the file or database not writing. So it should be possible to access different parts of the file via different threads in parallel. Even if your input was in memory this was faster than naive search. This is still O (n), but depending on number of threads which they operate in parallel (let say e.g. t), it uses about n/t comparisons. and it means it is about t times faster than a naive algorithm.

At the end I should say that bit optimization on the small array is useless as the main time and the main point is how to maintain a big file not maintaining a small array.

Upvotes: 2

stackoverflowuser2010

Reputation: 40909

In general, to find the K largest numbers from N numbers:

Sort the numbers in O(N lg N) time and then take the K largest. If you have billions of numbers on disk, you will have to do external (on-disk) sorting, such as external MergeSort.
Use a Min-Heap of capacity K and scan through the N values. Keep the K largest values in the heap, of which the smallest of those values is at the top. Running time: O(N lg K). You can keep the Min-heap in memory as you scan through the numbers from disk.
Use a selection algorithm to find the (N-K)th largest value in expected time O(N). The Quickselect algorithm that uses Quicksort's partition algorithm will also partition the values such that the K largest values are on one side of the (N-K)th largest. Expected running time: O(N). However, that selection algorithm is in-memory.

Upvotes: 1

Henry

Reputation: 43738

If the file is not sorted, you have to look at least once at every number in the file because it could be among the 10 largest. Therefore O(n) is the best you can achieve.

Some optimization is possible (however without changing the asymptotic complexity) by replacing the maxNum array with a min-heap. This will run faster if the count of numbers to be found is large enough (say you are looking for the 100 largest numbers). It will probably not yet pay off at 10.

Upvotes: 4

How to efficiently find 10 greatest numbers from billions of numbers?

Answers (3)

Related Questions