Spandan
Spandan

Reputation: 2138

Disperse Duplicates in an Array

Source : Google Interview Question

Write a routine to ensure that identical elements in the input are maximally spread in the output?

Basically, we need to place the same elements,in such a way , that the TOTAL spreading is as maximal as possible.

Example:

Input: {1,1,2,3,2,3}

Possible Output: {1,2,3,1,2,3}  

Total dispersion = Difference between position of 1's + 2's + 3's = 4-1 + 5-2 + 6-3 = 9 .

I am NOT AT ALL sure, if there's an optimal polynomial time algorithm available for this.Also,no other detail is provided for the question other than this .

What i thought is,calculate the frequency of each element in the input,then arrange them in the output,each distinct element at a time,until all the frequencies are exhausted.

I am not sure of my approach .

Any approaches/ideas people .

Upvotes: 11

Views: 529

Answers (4)

j_random_hacker
j_random_hacker

Reputation: 51246

HugoRune's answer takes some advantage of the unusual scoring function but we can actually do even better: suppose there are d distinct non-unique values, then the only thing that is required for a solution to be optimal is that the first d values in the output must consist of these in any order, and likewise the last d values in the output must consist of these values in any (i.e. possibly a different) order. (This implies that all unique numbers appear between the first and last instance of every non-unique number.)

The relative order of the first copies of non-unique numbers doesn't matter, and likewise nor does the relative order of their last copies. Suppose the values 1 and 2 both appear multiple times in the input, and that we have built a candidate solution obeying the condition I gave in the first paragraph that has the first copy of 1 at position i and the first copy of 2 at position j > i. Now suppose we swap these two elements. Element 1 has been pushed j - i positions to the right, so its score contribution will drop by j - i. But element 2 has been pushed j - i positions to the left, so its score contribution will increase by j - i. These cancel out, leaving the total score unchanged.

Now, any permutation of elements can be achieved by swapping elements in the following way: swap the element in position 1 with the element that should be at position 1, then do the same for position 2, and so on. After the ith step, the first i elements of the permutation are correct. We know that every swap leaves the scoring function unchanged, and a permutation is just a sequence of swaps, so every permutation also leaves the scoring function unchanged! This is true at for the d elements at both ends of the output array.

When 3 or more copies of a number exist, only the position of the first and last copy contribute to the distance for that number. It doesn't matter where the middle ones go. I'll call the elements between the 2 blocks of d elements at either end the "central" elements. They consist of the unique elements, as well as some number of copies of all those non-unique elements that appear at least 3 times. As before, it's easy to see that any permutation of these "central" elements corresponds to a sequence of swaps, and that any such swap will leave the overall score unchanged (in fact it's even simpler than before, since swapping two central elements does not even change the score contribution of either of these elements).

This leads to a simple O(nlog n) algorithm (or O(n) if you use bucket sort for the first step) to generate a solution array Y from a length-n input array X:

  1. Sort the input array X.
  2. Use a single pass through X to count the number of distinct non-unique elements. Call this d.
  3. Set i, j and k to 0.
  4. While i < n:
    • If X[i+1] == X[i], we have a non-unique element:
      • Set Y[j] = Y[n-j-1] = X[i].
      • Increment i twice, and increment j once.
      • While X[i] == X[i-1]:
        • Set Y[d+k] = X[i].
        • Increment i and k.
    • Otherwise we have a unique element:
      • Set Y[d+k] = X[i].
      • Increment i and k.

Upvotes: 0

roman
roman

Reputation: 117485

python code for algorithm suggested by Vorsprung and HugoRune:

from collections import Counter, defaultdict

def max_spread(data):
    cnt = Counter()
    for i in data: cnt[i] += 1

    res, num  = [], list(cnt)
    while len(cnt) > 0:
        for i in num:
            if num[i] > 0:
                res.append(i)
                cnt[i] -= 1
            if cnt[i] == 0: del cnt[i]

    return res

def calc_spread(data):
    d = defaultdict()
    for i, v in enumerate(data):
        d.setdefault(v, []).append(i)

    return sum([max(x) - min(x) for _, x in d.items()])

Upvotes: 0

HugoRune
HugoRune

Reputation: 13809

I believe this simple algorithm would work:

  • count the number of occurrences of each distinct element.
  • make a new list
  • add one instance of all elements that occur more than once to the list (order within each group does not matter)
  • add one instance of all unique elements to the list
  • add one instance of all elements that occur more than once to the list
  • add one instance of all elements that occur more than twice to the list
  • add one instance of all elements that occur more than trice to the list
  • ...

Now, this will intuitively not give a good spread:
for {1, 1, 1, 1, 2, 3, 4} ==> {1, 2, 3, 4, 1, 1, 1}
for {1, 1, 1, 2, 2, 2, 3, 4} ==> {1, 2, 3, 4, 1, 2, 1, 2}

However, i think this is the best spread you can get given the scoring function provided. Since the dispersion score counts the sum of the distances instead of the squared sum of the distances, you can have several duplicates close together, as long as you have a large gap somewhere else to compensate.

for a sum-of-squared-distances score, the problem becomes harder. Perhaps the interview question hinged on the candidate recognizing this weakness in the scoring function?

Upvotes: 4

Vorsprung
Vorsprung

Reputation: 34367

In perl

@a=(9,9,9,2,2,2,1,1,1);

then make a hash table of the counts of different numbers in the list, like a frequency table

map { $x{$_}++ } @a;

then repeatedly walk through all the keys found, with the keys in a known order and add the appropriate number of individual numbers to an output list until all the keys are exhausted

@r=();
$g=1; 
while( $g == 1 ) { 
   $g=0;
   for my $n (sort keys %x) 
      {
      if ($x{$n}>1) {
                    push @r, $n;
                    $x{$n}--;
                    $g=1
                    }
      } 
}

I'm sure that this could be adapted to any programming language that supports hash tables

Upvotes: 1

Related Questions