Reputation: 52987

Best method for sorting when you can use numeric indices?

Most of the time, we use built-in libraries for sorting, which are generic. But most of the times, too, we are sorting based on numeric indexes or other values that can be translated in indices. If I'm not mistaken, sorting numbers is O(n). So why aren't we ever using numeric sorting algorithms at all?

Upvotes: 1

Answers (2)

Bernhard Barker

Reputation: 55609

Is there really a need?

I'm not really sure (single) integers (or floating points for that matter, though most numeric sorts require / are efficient for integers) are what is being sorted 'most of the time', thus having some algorithm that only works on integers doesn't seem particularly useful. I say 'single' integers as opposed to (strings or) objects (or equivalent) that contain multiple integers, numbers, strings or whatever else.

Not to mention that (I believe) the bottleneck of any real-world program (that's primary purpose is more than just sorting data) (well, most of them) should not be sorting 'single' numbers using an O(n log n) sort. You're probably way better off changing the way your data is represented to remove the need for the sort rather than cutting down on the log n factor.

Numeric sorts

It's a common misperception, but no sorting algorithm (numeric or otherwise) is actually worst-case O(n). There's always some additional parameter that comes into play. For radix sort, the length of the numbers is the determining factor. For long numbers in short arrays, this length can easily be more than log n, resulting in worse performance than a O(n log n) sort (see below test).

Now numeric sorts are useful and way better than any comparison-based sorting algorithm ~~given that your data conforms to specific constraints~~ most (but not all) of the time (by looking at the complexity given by any decent reference, you should easily see what determines whether it will be good - e.g. O(kN) implies long numbers might cause it to take a bit longer, things like dealing with duplicates well is a bit more subtle).

So, why aren't they used?

Without extensive real-world experience / theoretical knowledge, ~~you're unlikely to pick the most efficient algorithm~~ it's entirely possible that you'll find yourself with a problem where the chosen algorithm, which should be awesome in theory severely under-performs a standard algorithm for your data, because of some subtle factor.

So standard libraries don't put you in the position to pick an incorrect sort and possibly have terrible performance because your data doesn't conform to some constraints. Library sorts tend to be decent all-round, but aren't specialized to specific data sets. Though I'm sure there are also libraries that focus on sorting algorithms, allowing you to pick from an extensive range of algorithms, but your average Joe the programmer probably doesn't want to / shouldn't be exposed to this choice.

Also note, while they aren't commonly included in libraries, it should be easy enough to find / write an implementation of whichever (popular) sort you wish to use ... which you should then benchmark against library sorts on a sufficient sample of your data before committing to it.

A somewhat random test

This is by no means intended to be a conclusive, 100% correct test with the best implementations of radix sort and quick sort to ever see the light of day. It's more to show that what the data looks like plays a large role in the performance of any given algorithm.

This is the only decent benchmark including radix-sort I could find in a few minutes of searching.

I ran the code and found this: (number range 0-2147483646)

(the time unit is related to nano-seconds, which doesn't really translate to seconds)

ArraySize Radix     Quick
10        1889      126       
100       2871      2702      
1000      18227     38075     
10000     360623    484128    
100000    2306284   6029230

Quick-sort is faster for a large range of numbers and arrays of less than size 100 (exactly what I was saying above). Interesting but nothing really amazing about it. I mean who cares about the performance of sorting less than 100 numbers?

However, look what happened when I changed the number range to 0-99:

ArraySize Radix     Quick
10        1937      121       
100       8932      2022      
1000      29513     14824     
10000     236669    125926    
100000    2393641   1225715

Quick-sort is consistently around 2x faster than Radix-sort for reasonably-sized arrays (1000-100000 elements).

You must be thinking - "What in the world? I thought radix sort was supposed to be good at these. I mean ... there's only 2 digits. And why is Quick-sort so much faster than in the above case?" Exactly. That's where "extensive real-world experience / theoretical knowledge" comes in. I suspect it relates to how well each algorithm / implementation deals with duplicates. But half of that could be because I might not have optimized the radix sort implementation for the smaller range (didn't know we do that? Well, that's another reason against trying to have a generic radix sort in a library)

Now 0-99 is probably not your typical data set either, and, overall, radix-sort is probably still better, but what you need to take away from all of this:

There's about a gazillion sorting algorithms. They vary greatly in what they're good at. Don't expect a standard library to give you a function for each. Comparison-based sorts can sort any comparable data types (and are fast enough for most practical applications) as opposed to numeric sorts which can only sort numbers. Thus having a single (or 2, as Java has) comparison-based sort in your (as in you, the person who wrote it) library is preferred.

Upvotes: 3

tmyklebu

Reputation: 14205

Basically, we use comparison-based sorting algorithms because it's easier. Being able to supply a comparison function and get your data sorted is a huge win from an engineering perspective, even if you pay for it with a speed hit.

Keep in mind that the O(n log n) comparison-based sorting bound counts comparisons, not total runtime. If you're sorting strings, for instance, comparison can take time linear in the lengths of the strings being compared.

A common misconception (that I see echoed in the other answer) is that comparison-based sorting winds up having faster asymptotic complexity when you're sorting a moderate number of long numbers; say they're k bytes each. This simply isn't true; you do about n log(n) number comparisons, each of which takes O(k) time, for an overall complexity of O(k n log n). This is worse than O(k n).

Engineering a fast radix sort is a little harder than the theory says. While theory dictates that you should choose as large a radix as possible, there is a tradeoff between the radix you choose and the locality you achieve when partitioning the input stream. A bigger radix means fewer passes but also less local use of memory.