Abhijit Sarkar
Abhijit Sarkar

Reputation: 24578

How to find all values that occur more than n/k times in an array?

I'm pursuing the Algorithms, Part I course on Coursera, and one of the interview questions (ungraded) is as follows:

Decimal dominants. Given an array with n keys, design an algorithm to find all values that occur more than n/10 times. The expected running time of your algorithm should be linear.

It has a hint:

determine the (n/10)th largest key using quickselect and check if it occurs more than n/10 times.

I don't understand what does the n/10 largest key have to do with n/10 repeated values. It won't tell me which values occur more than n/10 times.

There's a paper that finds a more general solution for n/k, but I'm having a hard time understanding the code in the paper.

One way to solve it is to sort the input array, and then make another pass counting the occurrence of each distinct value. That'll take O(nlogn) + O(n) time, which is more than what the question asks for.

Ideas?

Upvotes: 5

Views: 6105

Answers (5)

Vladimir
Vladimir

Reputation: 43

I decided this question with Dijkstra's approach to solve DNF problem. Please see my emplementation below. May be it will be interest and useful to somebody.

public void dCount(Comparable[] arr) {
    dCount(arr, 0, arr.length - 1);
}
private void dCount(Comparable[] arr, int lo, int hi) {
    if (lo >= hi) return;
    int curr = lo;
    int lt = lo;
    int rt = hi;
    Comparable pivot = arr[lo];

    while (curr <= rt) { // 3-way qSort main cycle
        if (less(arr[curr], pivot))
            swap(arr, curr++, lt++);
        else if (less(pivot, arr[curr]))
            swap(arr, curr, rt--);
        else curr++;
    }
    int count = curr - lt;
    if (count > arr.length / 10) { // you can change count value if it needs (n / 3, n / 2... just change 10 to your number)           
        System.out.printf("%s repeats %d times\n", arr[lt], count);
    }
    dCount(arr, lo, lt -1); // recur call for the left side from the pivot equal range
    dCount(arr, rt + 1, hi); // recur call for the right side from the pivot equal range
}
private static boolean less(Comparable a, Comparable b) {
    return a.compareTo(b) < 0;
}
private static void swap(Comparable[] arr, int i, int j) {
    Comparable swap = arr[i];
    arr[i] = arr[j];
    arr[j] = swap;
}

Upvotes: -1

serg
serg

Reputation: 1023

To say the truth I haven't got any of answers here, but they gave an idea how to solve the problem. Based on the answer by @MBo as was mentioned, you use quick selection for an every 10th element, but you use not a simple quick selection, but the quick selection based on 3-way quick sort. You not just place every 10th element on its place, but also all elements that are equal to it. And after one iteration is enough to count all elements that have more than 9 duplicates.

Upvotes: -1

thebenman
thebenman

Reputation: 1621

There is a variation of Boyer-Moore Voting algorithm which can find all the elements that occurs more than n/k in a input which runs in O(nk) and since k = 10 for your problem I think it should run in O(n * 10) = O(n) time.

From here

Following is an interesting O(nk) solution: We can solve the above problem in O(nk) time using O(k-1) extra space. Note that there can never be more than k-1 elements in output (Why?). There are mainly three steps in this algorithm.

1) Create a temporary array of size (k-1) to store elements and their counts (The output elements are going to be among these k-1 elements). Following is structure of temporary array elements.

 struct eleCount {
     int element;
     int count; };  
 struct eleCount temp[];  This step takes O(k) time.

2) Traverse through the input array and update temp[] (add/remove an element or increase/decrease count) for every traversed element. The array temp[] stores potential (k-1) candidates at every step. This step takes O(nk) time.

3) Iterate through final (k-1) potential candidates (stored in temp[]). or every element, check if it actually has count more than n/k. This step takes O(nk) time.

The main step is step 2, how to maintain (k-1) potential candidates at every point? The steps used in step 2 are like famous game: Tetris. We treat each number as a piece in Tetris, which falls down in our temporary array temp[]. Our task is to try to keep the same number stacked on the same column (count in temporary array is incremented).

Consider k = 4, n = 9 Given array: 3 1 2 2 2 1 4 3 3

i = 0

     3 _ _ temp[] has one element, 3 with count 1

i = 1

     3 1 _ temp[] has two elements, 3 and 1 with  counts 1 and 1 respectively

i = 2

     3 1 2 temp[] has three elements, 3, 1 and 2 with counts as 1, 1 and 1 respectively.

i = 3

     - - 2 
     3 1 2 temp[] has three elements, 3, 1 and 2 with counts as 1, 1 and 2 respectively.

i = 4

     - - 2 
     - - 2 
     3 1 2 temp[] has three elements, 3, 1 and 2 with counts as 1, 1 and 3 respectively.

i = 5

     - - 2 
     - 1 2 
     3 1 2 temp[] has three elements, 3, 1 and 2 with counts as 1, 2 and 3 respectively.  

Now the question arises, what to do when temp[] is full and we see a new element – we remove the bottom row from stacks of elements, i.e., we decrease count of every element by 1 in temp[]. We ignore the current element.

i = 6

     - - 2 
     - 1 2  temp[] has two elements, 1 and 2 with counts as 1 and 2 respectively.

i = 7

       - 2 
     3 1 2  temp[] has three elements, 3, 1 and 2 with counts as 1, 1 and 2 respectively.

i = 8

     3 - 2
     3 1 2  temp[] has three elements, 3, 1 and 2 with counts as 2, 1 and 2 respectively. 

Finally, we have at most k-1 numbers in temp[]. The elements in temp are {3, 1, 2}. Note that the counts in temp[] are useless now, the counts were needed only in step 2. Now we need to check whether the actual counts of elements in temp[] are more than n/k (9/4) or not. The elements 3 and 2 have counts more than 9/4. So we print 3 and 2.

For a proper proof of this approach check out this answer from cs.stackexchange

Upvotes: 4

MBo
MBo

Reputation: 80197

You are right.

Decimal dominant is among the next candidates after QuickSelect: n/10, 2*n/10..9*n/10, so checking only n/10 index is not sufficient

Note that dominant occupies long run in sorted array and certainly at least one of elements with mentioned indexes belongs to that run.

Example for k = 3, N = 11. Let element b occupies at least 1/3 of array. In this case sorted array might look like

b b b b * * * * * * * 
* b b b b * * * * * * 
* * b b b b * * * * *     
* * * b b b b * * * *     
* * * * b b b b * * *
* * * * * b b b b * *
* * * * * b b b b * *
* * * * * * b b b b *
* * * * * * * b b b b
      ^       ^       //positions for quickselect

Note that in any case dominant element (if k-dominant does exist) occupies at least one of marked places. So after two rounds of QuickSelect we have two candidates

Upvotes: 1

mcdowella
mcdowella

Reputation: 19601

Finding the n/10th largest key (that is, the key that would be at position n/10 if the array was sorted) takes linear time using QuickSelect. If there are less than n/10 copies of this key, then you know that there are not n/10 copies of anything above it in sorted order, because there isn't room for n/10 copies of anything above the key in sorted order. If there are n/10 or more copies, then you have found something that occurs more than n/10 times, and again there can't be anything larger than it that occurs more than n/10 times, because there isn't room for it.

Now you have an array of at most 9n/10 values smaller than the key you have just found left over from QuickSelect. Use another pass of QuickSelect to find the key n/10 from the top of this left over array. As before, you may find a key that occurs n/10 or more times, and whether you do or not you will eliminate at least n/10 entries from the array.

So you can search the whole array with 10 calls of QuickSelect, each taking linear time. 10 is a number fixed in the problem definition, so the whole operation counts as only linear time.

Upvotes: 6

Related Questions