Sort and binary search or just linear search?

Question

Problem

The time complexity of a selection sort is n*(n-1)/2. Given a list of 1000 items, how many worst case searches using linear search must be needed before it is faster to sort and use binary search?

Attempt at solution

The time complexity of a binary search is floor(log(n)/log(2)+1). Hence the time complexity of one selection sort followed by x binary searches is n*(n-1)/2+x*floor(log(n)/log(2)+1). The time complexity of a linear search is n, therefore the time complexity of x linear searches is x*n.

Thus we establish the equality x*n=n*(n-1)/2+x*floor(log(n)/log(2)+1), which for n=1000 yields x=5550/11 and hence floor(x)=floor(5550/11)=504. Is this correct?

Jerry Coffin · Accepted Answer

You're trying for a lot more precision than you can get from simple number of operations. To get an exact number, you'd need to account for things like the processor cache (linear search probably uses it a little better).

To get a general idea, however, we can look at big-O complexity. The (expected) complexity of the sort is O(N log N). The complexity of a binary search is O(log N), so the complexity of M binary searches is O(M log N). Overall complexity of sorting then binary searching is approximately O((N+M) log N).

For linear search, the complexity of one search is O(N), so the complexity of M searches is O(NM).

So, for N = 1000, the break-even point should be apprxoimately:

(1000+M) * log₂(1000) = 1000 M.

That's approximately: (1000+M) * 10 = 1000 M.

If we cancel the constants and divide, we get: 1000+M = 100M, then: (1000+M)/100 = M

Distributing, we get: 1000/100 + M/100 = M

Subtracting, then dividing the constants, gets to: 10 = .99M

Dividing, we get essentially M = 10.

So, ignoring differences in constant factors, our break-even point should be around 10 searches.

On a real CPU, I'd guess most of those constant factors (branch prediction, cache usage, etc.) tend to favor linear searching over binary searching to some degree (i.e., a linear search works nicely with a cache and its branches are highly predictable), so we'd expect to see a linear search remain faster than binary search up to somewhat more searches than that--probably to at least 20 searches, and quite possibly as many as 50 or 100.

There's probably one other factor to consider though. For 1000 items, a Shell-Metzner sort, with roughly O(N^1.2) complexity might very well end up somewhat faster than a Quicksort. In my experience, it often is up to around 1500-2000 items. That's, if anything, even more difficult to estimate with any real accuracy though, so I'm just going to guess that there's a good chance it'll reduce the break-even point a little bit.

Sort and binary search or just linear search?

Answers (1)

Related Questions