Jackson Tale
Jackson Tale

Reputation: 25812

algorithm - Sort an array with LogLogN distinct elements

This is not my school home work. This is my own home work and I am self-learning algorithms.

In Algorithm Design Manual, there is such an excise

4-25 Assume that the array A[1..n] only has numbers from {1, . . . , n^2} but that at most log log n of these numbers ever appear. Devise an algorithm that sorts A in substantially less than O(n log n).

I have two approaches:


The first approach:

Basically I want to do counting sort for this problem. I can first scan the whole array (O(N)) and put all distinct numbers into a loglogN size array (int[] K).

Then apply counting sort. However, when setting up the counting array (int[] C), I don't need to set its size as N^2, instead, I set the size as loglogN too.

But in this way, when counting the frequencies of each distinct number, I have to scan array K to get that element's index (O(NloglogN) and then update array C.


The second approach:

Again, I have to scan the whole array to get a distinct number array K with size loglogN.

Then I just do a kind of quicksort like, but the partition is based on median of K array (i.e., each time the pivot is an element of K array), recursively.

I think this approach will be best, with O(NlogloglogN).


Am I right? or there are better solutions?

Similar excises exist in Algorithm Design Manual, such as

4-22 Show that n positive integers in the range 1 to k can be sorted in O(n log k) time. The interesting case is when k << n.

4-23 We seek to sort a sequence S of n integers with many duplications, such that the number of distinct integers in S is O(log n). Give an O(n log log n) worst-case time algorithm to sort such sequences.

But basically for all these excises, my intuitive was always thinking of counting sort as we can know the range of the elements and the range is short enough comparing to the length of the whole array. But after more deeply thinking, I guess what the excises are looking for is the 2nd approach, right?

Thanks

Upvotes: 12

Views: 5218

Answers (4)

tranquil
tranquil

Reputation: 499

We can just create a hash map storing each element as key and its frequency as value.

Sort this map in log(n)*log(log(n)) time i.e (klogk) using any sorting algorithm.

Now scan the hash map and add elements to the new array frequency number of times. Like so:

total time = 2n+log(n)*log(log(n)) = O(n) 

Upvotes: 5

thb
thb

Reputation: 14454

Update: After I wrote the answer below, @Nabb showed me why it was incorrect. For more information, see Wikipedia's brief entry on Õ, and the links therefrom. At least because it is still needed to lend context to @Nabb's and @Blueshift's comments, and because the whole discussion remains interesting, my original answer is retained, as follows.

ORIGINAL ANSWER (INCORRECT)

Let me offer an unconventional answer: though there is indeed a difference between O(n*n) and O(n), there is no difference between O(n) and O(n*log(n)).

Now, of course, we all know that what I just said is wrong, don't we? After all, various authors concur that O(n) and O(n*log(n)) differ.

Except that they don't differ.

So radical-seeming a position naturally demands justification, so consider the following, then make up your own mind.

Mathematically, essentially, the order m of a function f(z) is such that f(z)/(z^(m+epsilon)) converges while f(z)/(z^(m-epsilon)) diverges for z of large magnitude and real, positive epsilon of arbitrarily small magnitude. The z can be real or complex, though as we said epsilon must be real. With this understanding, apply L'Hospital's rule to a function of O(n*log(n)) to see that it does not differ in order from a function of O(n).

I would contend that the accepted computer-science literature at the present time is slightly mistaken on this point. This literature will eventually refine its position in the matter, but it hasn't done, yet.

Now, I do not expect you to agree with me today. This, after all, is merely an answer on Stackoverflow -- and what is that compared to an edited, formally peer-reviewed, published computer-science book -- not to mention a shelffull of such books? You should not agree with me today, only take what I have written under advisement, mull it over in your mind these coming weeks, consult one or two of the aforementioned computer-science books that take the other position, and make up your own mind.

Incidentally, a counterintuitive implication of this answer's position is that one can access a balanced binary tree in O(1) time. Again, we all know that that's false, right? It's supposed to be O(log(n)). But remember: the O() notation was never meant to give a precise measure of computational demands. Unless n is very large, other factors can be more important than a function's order. But, even for n = 1 million, log(n) is only 20, compared, say, to sqrt(n), which is 1000. And I could go on in this vein.

Anyway, give it some thought. Even if, eventually, you decide that you disagree with me, you may find the position interesting nonetheless. For my part, I am not sure how useful the O() notation really is when it comes to O(log something).

@Blueshift asks some interesting questions and raises some valid points in the comments below. I recommend that you read his words. I don't really have a lot to add to what he has to say, except to observe that, because few programmers have (or need) a solid grounding in the mathematical theory of the complex variable, the O(log(n)) notation has misled probably, literally hundreds of thousands of programmers to believe that they were achieving mostly illusory gains in computational efficiency. Seldom in practice does reducing O(n*log(n)) to O(n) really buy you what you might think that it buys you, unless you have a clear mental image of how incredibly slow a function the logarithm truly is -- whereas reducing O(n) even to O(sqrt(n)) can buy you a lot. A mathematician would have told the computer scientist this decades ago, but the computer scientist wasn't listening, was in a hurry, or didn't understand the point. And that's all right. I don't mind. There are lots and lots of points on other subjects I don't understand, even when the points are carefully explained to me. But this is a point I believe that I do happen to understand. Fundamentally, it is a mathematical point not a computer point, and it is a point on which I happen to side with Lebedev and the mathematicians rather than with Knuth and the computer scientists. This is all.

Upvotes: 0

blueshift
blueshift

Reputation: 6882

I'm going to betray my limited knowledge of algorithmic complexity here, but:

Wouldn't it make sense to scan the array once and build something like a self-balancing tree? As we know the number of nodes in the tree will only grow to (log log n) it is relatively cheap (?) to find a number each time. If a repeat number is found (likely) a counter in that node is incremented. Then to construct the sorted array, read the tree in order.

Maybe someone can comment on the complexity of this and any flaws.

Upvotes: 0

Jarosław Gomułka
Jarosław Gomułka

Reputation: 4995

Counting sort is one of possible ways:

  1. I will demonstrate this solution on example 2, 8, 1, 5, 7, 1, 6 and all number are <= 3^2 = 9. I use more elements to make my idea more clearer.
  2. First for each number A[i] compute A[i] / N. Lets call this number first_part_of_number.
  3. Sort this array using counting sort by first_part_of_number.
  4. Results are in form (example for N = 3)

    (0, 2)
    (0, 1)
    (0, 1)
    (2, 8)
    (2, 6)
    (2, 7)
    (2, 6)

  5. Divide them into groups by first_part_of_number.

  6. In this example you will have groups
    (0, 2) (0, 1) (0, 1)

    and

    (2, 8) (2, 6) (2, 7) (2, 6)

  7. For each number compute X modulo N. Lets call it second_part_of_number. Add this number to each element
    (0, 2, 2) (0, 1, 1) (0, 1, 1)

    and

    (2, 8, 2) (2, 6, 0) (2, 7, 1) (2, 6, 0)

  8. Sort each group using counting sort by second_part_of_number

    (0, 1, 1) (0, 1, 1) (0, 2, 2)

    and

    (2, 6, 0) (2, 6, 0) (2, 7, 1) (2, 8, 2)

  9. Now combine all groups and you have result 1, 1, 2, 6, 6, 7, 8.

Complexity: You were using only counting sort on elements <= N. Each element took part in exactly 2 "sorts". So overall complexity is O(N).

Upvotes: 0

Related Questions