javaarraysalgorithmsortingdata-structures

Reputation: 625

Best way to retrieve K largest elements from large unsorted arrays?

I recently had a coding test during an interview. I was told:

There is a large unsorted array of one million ints. User wants to retrieve K largest elements. What algorithm would you implement?

During this, I was strongly hinted that I needed to sort the array.

So, I suggested to use built-in sort() or maybe a custom implementation if performance really mattered. I was then told that using a Collection or array to store the k largest and for-loop it is possible to achieve approximately O(N), in hindsight, I think it's O(N*k) because each iteration needs to compare to the K sized array to find the smallest element to replace, while the need to sort the array would cause the code to be at least O(N log N).

I then reviewed this link on SO that suggests priority queue of K numbers, removing the smallest number every time a larger element is found, which would also give O(N log N). Write a program to find 100 largest numbers out of an array of 1 billion numbers

Is the for-loop method bad? How should I justify pros/cons of using the for-loop or the priorityqueue/sorting methods? I'm thinking that if the array is already sorted, it could help by not needing to iterate through the whole array again, i.e. if some other method of retrieval is called on the sorted array, it should be constant time. Is there some performance factor when running the actual code that I didn't consider when theorizing pseudocode?

Upvotes: 36

Answers (6)

Alexander Ivanchenko

Reputation: 29058

There is a large unsorted array of one million ints. The user wants to retrieve the K largest elements.

During this, I was strongly hinted that I needed to sort the array.

So, I suggested using a built-in sort() or maybe a custom implementation

That wasn't really a hint I guess, but rather a sort of trick to deceive you (to test how strong your knowledge is).

If you choose to approach the problem by sorting the whole source array using the built-in Dual-Pivot Quicksort, you can't obtain time complexity better than O(n log n).

Instead, we can maintain a PriorityQueue which would store the result. And while iterating over the source array for each element we need to check whether the queue has reached the size K, if not the element should be added to the queue, otherwise (is size equals to K) we need to compare the next element against the lowest element in the queue - if the next element is smaller or equal we should ignore it if it is greater the lowest element has to be removed and the new element needs to be added.

The time complexity of this approach would be O(n log k) because adding a new element into the PriorityQueue of size k costs O(log k) and in the worst-case scenario this operation can be performed n times (because we're iterating over the array of size n).

Note that the best case time complexity would be Ω(n), i.e. linear.

So the difference between sorting and using a PriorityQueue in terms of Big O boils down to the difference between O(n log n) and O(n log k). When k is much smaller than n this approach would give a significant performance gain.

Here's an implementation:

public static int[] getHighestK(int[] arr, int k) {
    Queue<Integer> queue = new PriorityQueue<>();
    
    for (int next: arr) {
        if (queue.size() == k && queue.peek() < next) queue.remove();
        if (queue.size() < k) queue.add(next);
    }
    
    return toIntArray(queue);
}

public static int[] toIntArray(Collection<Integer> source) {
    return source.stream().mapToInt(Integer::intValue).toArray();
}

main()

public static void main(String[] args) {
    System.out.println(Arrays.toString(getHighestK(new int[]{3, -1, 3, 12, 7, 8, -5, 9, 27}, 3)));
}

Output:

[9, 12, 27]

Sorting in O(n)

We can achieve worst case time complexity of O(n) when there are some constraints regarding the contents of the given array. Let's say it contains only numbers in the range [-1000,1000] (sure, you haven't been told that, but it's always good to clarify the problem requirements during the interview).

In this case, we can use Counting sort which has linear time complexity. Or better, just build a histogram (first step of Counting Sort) and look at the highest-valued buckets until you've seen K counts. (i.e. don't actually expand back to a fully sorted array, just expand counts back into the top K sorted elements.) Creating a histogram is only efficient if the array of counts (possible input values) is smaller than the size of the input array.

Another possibility is when the given array is partially sorted, consisting of several sorted chunks. In this case, we can use Timsort which is good at finding sorted runs. It will deal with them in a linear time.

And Timsort is already implemented in Java, it's used to sort objects (not primitives). So we can take advantage of the well-optimized and thoroughly tested implementation instead of writing our own, which is great. But since we are given an array of primitives, using built-in Timsort would have an additional cost - we need to copy the contents of the array into a list (or array) of wrapper type.

Upvotes: 12

AnoE

Reputation: 8355

There is an algorithm to do this in worst-case time complexity O(n*log(k)) with very benign time constants (since there is just one pass through the original array, and the inner part that contributes to the log(k) is only accessed relatively seldomly if the input data is well-behaved).

Initialize a priority queue implemented with a binary heap A of maximum size k (internally using an array for storage). In the worst case, this has O(log(k)) for inserting, deleting and searching/manipulating the minimum element (in fact, retrieving the minimum is O(1)).
Iterate through the original unsorted array, and for each value v:
- If A is not yet full then
  - insert v into A,
- else, if v>min(A) then (*)
  - insert v into A,
  - remove the lowest value from A.

(*) Note that A can return repeated values if some of the highest k values occur repeatedly in the source set. You can avoid that by a search operation to make sure that v is not yet in A. You'd also want to find a suitable data structure for that (as the priority queue has linear complexity), i.e. a secondary hash table or balanced binary search tree or something like that, both of which are available in java.util.

The java.util.PriorityQueue helpfully guarantees the time complexity of its operations:

this implementation provides O(log(n)) time for the enqueing and dequeing methods (offer, poll, remove() and add); linear time for the remove(Object) and contains(Object) methods; and constant time for the retrieval methods (peek, element, and size).

Note that as laid out above, we only ever remove the lowest (first) element from A, so we enjoy the O(log(k)) for that. If you want to avoid duplicates as mentioned above, then you also need to search for any new value added to it (with O(k)), which opens you up to a worst-case overall scenario of O(n*k) instead of O(n*log(k)) in case of a pre-sorted input array, where every single element v causes the inner loop to fire.

Upvotes: 3

GeertPt

Reputation: 17874

I think you misunderstood what you needed to sort.

You need to keep the K-sized list sorted, you don't need to sort the original N-sized input array. That way the time complexity would be O(N * log(K)) in the worst case (assuming you need to update the K-sized list almost every time).

The requirements said that N was very large, but K is much smaller, so O(N * log(K)) is also smaller than O(N * log(N)).

You only need to update the K-sized list for each record that is larger than the K-th largest element before it. For a randomly distributed list with N much larger than K, that will be negligible, so the time complexity will be closer to O(N).

For the K-sized list, you can take a look at the implementation of Is there a PriorityQueue implementation with fixed capacity and custom comparator? , which uses a PriorityQueue with some additional logic around it.

Upvotes: 3

qwr

Reputation: 11024

This is a classic problem that can be solved with so-called heapselect, a simple variation on heapsort. It also can be solved with quickselect, but like quicksort has poor quadratic worst-case time complexity.

Simply keep a priority queue, implemented as binary heap, of size k of the k smallest values. Walk through the array, and insert values into the heap (worst case O(log k)). When the priority queue is too large, delete the minimum value at the root (worst case O(log k)). After going through the n array elements, you have removed the n-k smallest elements, so the k largest elements remain. It's easy to see the worst-case time complexity is O(n log k), which is faster than O(n log n) at the cost of only O(k) space for the heap.

Upvotes: 6

Level_Up

Reputation: 824

Here is one idea. I will think for creating array (int) with max size (2147483647) as it is max value of int (2147483647). Then for every number in for-each that I get from the original array just put the same index (as the number) +1 inside the empty array that I created.

So in the end of this for each I will have something like [1,0,2,0,3] (array that I created) which represent numbers [0, 2, 2, 4, 4, 4] (initial array).

So to find the K biggest elements you can make backward for over the created array and count back from K to 0 every time when you have different element then 0. If you have for example 2 you have to count this number 2 times.

The limitation of this approach is that it works only with integers because of the nature of the array...

Also the representation of int in java is -2147483648 to 2147483647 which mean that in the array that need to be created only the positive numbers can be placed.

NOTE: if you know that there is max number of the int then you can lower the created array size with that max number. For example if the max int is 1000 then your array which you need to create is with size 1000 and then this algorithm should perform very fast.

Upvotes: 4

Berthur

Reputation: 4495

Another way of solving this is using Quickselect. This should give you a total average time complexity of O(n). Consider this:

Find the kth largest number x using Quickselect (O(n))
Iterate through the array again (or just through the right-side partition) (O(n)) and save all elements ≥ x
Return your saved elements

(If there are repeated elements, you can avoid them by keeping count of how many duplicates of x you need to add to the result.)

The difference between your problem and the one in the SO question you linked to is that you have only one million elements, so they can definitely be kept in memory to allow normal use of Quickselect.

Upvotes: 26

Best way to retrieve K largest elements from large unsorted arrays?

Answers (6)

Sorting in O(n)

Related Questions