Reputation: 594

Sorting algorithm vs. Simple iterations

I'm just getting started in algorithms and sorting, so bear with me...

Let's say I have an array of 50000 integers. I need to select the smallest 30000 of them.
I thought of two methods :
1. I iterate the entire array and find each smallest integer
2. I first sort the entire array , and then simply select the first 30000.

Can anyone tell me what's the difference, which method would be faster, and why? What if the array was smaller or bigger? Would the answer change?

Upvotes: 3

Answers (4)

Gene

Reputation: 47020

For nearly all practical purposes, sorting and taking the first 30,000 is the likely to be best. In most languages, this is one or two lines of code. Hard to get wrong.

If you have a truly demanding application or are just out to fiddle, you can use a selection algorithm to find the 30,000th largest number. Then one more pass through the array will find 29,999 that are no bigger.

There are several well known selection algorithms that require only O(n) comparisons and some that are sub-linear for data with specific properties.

The fastest in practice is QuickSelect, which - as its name implies - works roughly like a partial QuickSort. Unfortunately, if the data happens to be very badly ordered, QuickSelect can require O(n^2) time (just as QuickSort can). There are various tricks for selecting pivots that the make it virtually impossible to get the worst case run time.

QuickSelect will finish with the array reordered so the smallest 30,000 elements are in the first part (unsorted) followed by the rest.

Because standard selection algorithms are comparison-based, they'll work on any kind of comparable data, not just integers.

Upvotes: 2

jfs

Reputation: 414825

30k is close to 50k. Just sort the array and get the smallest 30k e.g., in Python: sorted(a)[:30000]. It is O(n * log n) operation.

If you were needed to find 100 smallest items instead (100 << 50k) then a heap might be more suitable e.g., in Python: heapq.nsmallest(100, a). It is O(n * log k).

If the range of integers is limited—you could consider O(n) sorting methods such as counting sort and radix sort.

Simple iterative method is O(n**2) (quadratic) here. Even for a moderate n that is around a million; it leads to ~10**12 operations that is much worse than ~10**6 for a linear algorithm.

Upvotes: 2

goodpoutine

Reputation: 43

You can do this in potentially O(N) time with radix sort or counting sort, given that your input is integers.

Another method is to get the 30000th largest integer by quickselect and simply iterate through the original array. This has Θ(N) time complexity, but in the worst case has O(N^2) for quickselect.

Upvotes: 0

paddy

Reputation: 63481

Option 1 sounds like the naive solution. It would involve passing through the array to find the smallest item 30000 times. Each time it finds the smallest, presumably it would swap that item to the beginning or end of the array. In basic terms, this is O(n^2) complexity.

The actual number of operations involved would be less than n^2 because n reduces every time. So you would have roughly 50000 + 49999 + 49998 + ... + 20001, which amounts to just over 1 billion (1000 million) iterations.

Option 2 would employ an algorithm like quicksort or similar, which is commonly O(n.logn).

Here it's harder to provide actual figures, because some efficient sorting algorithms can have a worst-case of O(n^2). But let's say you use a well-behaved one that is guaranteed to be O(n.logn). This would amount to 50000 * 15.61 which is about 780 thousand.

So it's clear that Option 2 wins in this case.

What if the array was smaller or bigger? Would the answer change?

Unless the array became trivially small, the answer would still be Option 2. And the larger your array becomes, the more beneficial Option 2 becomes. This is the nature of time complexity. O(n^2) grows much faster than O(n.logn).

A better question to ask is "what if I want fewer smallest values, and when does Option 1 become preferable?". Although the answer is slightly more complex because of numerous factors (such as what constitutes "one operation" in Option 1 vs Option 2, plus other issues like memory access patterns etc), you can get the simple answer directly from time complexity. Option 1 would become preferable when the number of smallest values to select drops below n.logn. In the case of a 50000-element array, that would mean if you want to select 15 or less smallest elements, then Option 1 wins.

Now, consider an Option 3, where you transform the array into a min-heap. Building a heap is O(n), and removing one item from it is O(logn). You are going to remove 30000 items. So you have the cost of building plus the cost of removal: 50000 + 30000 * 15.6 = approximately 520 thousand. And this is ignoring the fact that n gets smaller every time you remove an element. It's still O(n.logn), like Option 2 but it is probably faster: you've saved time by not bothering to sort the elements you don't care about.

I should mention that in all three cases, the result would be the smallest 30000 values in sorted order. There may be other solutions that would give you these values in no particular order.

Upvotes: 2

Sorting algorithm vs. Simple iterations

Answers (4)

Related Questions