user4333994
user4333994

Reputation: 51

Performance of binary search algorithm when there are many duplicates

http://katemats.com/interview-questions/ says:

  • You are given a sorted array and you want to find the number N. How do you do the search as quickly as possible (not just traversing each element)?

    • How would the performance of your algorithm change if there were lots of duplicates in the array?

My answer to the first question is binary search, which is O(log(n)), where n is the number of elements in the array.

According to this answer, "we have a maximum of log_2(n-1) steps" in the worst case when "element K is not present in A and smaller than all elements in A".

I think the answer to the second question is it doesn't affect the performance. Is this correct?

Upvotes: 2

Views: 481

Answers (2)

Jason Nordwick
Jason Nordwick

Reputation: 1556

If you are talking worst case / big O, then you are correct - log(n) is your bound. However, if your data is fairly uniformly distributed (or you can map to that distribution), interpolating where to pick your partition can get log(log(n)) behavior. When you do the interpolation too, you also get rid of your worse cases where you have looking for one of the end elements (of course there are new pathological cases though).

For many many duplicates you might be willing to stride further away the direct center on the next probe. With more dups, you get a better margin for guessing correctly. While always choosing the half-way point gets you there in good time, educated guesses might get you some really excellent average behavior.

When I interview, I like to hear those answers, both knowledge of the book and what the theoretical is, but also what things can be done to specialize to the given situation. Often these constant factors can be really helpful (look at quicksort and its partition choosing schemes).

Upvotes: 2

Michael Dowd
Michael Dowd

Reputation: 231

I don't think having duplicates matters.

You're looking for a particular number N, what matters is whether or not the current node matches N.

If I'm looking for the number 1 in the list 1-2-3-4-5-6 the performance would be identical to searching the list 1-9-9-9-9-9.

If the number N is duplicated then you will have a chance of finding it a couple steps sooner. For example if the same search was done on the list 1-1-1-1-1-9.

Upvotes: 0

Related Questions