Reputation: 11

Programmatically Split A Sequence of Numbers Into 4 Relatively Equal Chunks

How would I go about splitting a sequence of numbers into 4 equal (as equal as possible) chunks?

If I have a sequence of integers as such:

16, 4, 17, 10, 15, 4, 4, 6, 7, 14, 9, 17, 27, 6, 1, 9, 0, 12, 20, 8, 0, 3, 4, 0, 3, 4

I want to split that sequence into 4 chunks where the summation of each chunk is as close to the quarter of the summation of the sequence as possible. The total value of the sequence is 220 so I would want chunks roughly equal to 55. The sequence is as such and it's order should not be changed.

Background: The numbers are representative of the number of entries in a phonebook that start with a certain letter. I'm trying to split the phonebook up in the best possible way.

Upvotes: 1

Answers (4)

Petr

Reputation: 9997

First of all, you should decide what is the exact value you want to minimize.

Let's denote S the total sum of the numbers, and s1, s2, s3 and s4 the sums of the four parts in some solution.

We can define many exact representations of a rather vague term "as equal as possible". That is, should max(s1,s2,s3,s4)-min(s1,s2,s3,s4) be as minimal as possible? Or should max(|s1-S/4|, |s2-S/4|, |s3-S/4|, |s4-S/4|) be as minimal as possible? Or, say, |s1-S/4|+|s2-S/4|+|s3-S/4|+|s4-S/4|? Etc.

I can think of a simple solution for the second metric: max(|s1-S/4|, |s2-S/4|, |s3-S/4|, |s4-S/4|) to be minimized.

Firstly, let's solve a different problem. Given your sequence and some value X, can we partition it in such a way that max(|s1-S/4|, |s2-S/4|, |s3-S/4|, |s4-S/4|)<=X? If we can solve this problem for arbitrary X, then the initial problem is solved by binary search on X.

So, how we check whether there exist a partition with max(|s1-S/4|, |s2-S/4|, |s3-S/4|, |s4-S/4|)<=X? This requirement is equivalent to the requirement S/4-X<=s[i]<=S/4+X, so for each chunk we know the allowed minimal and maximal sum.

Now go from the beginning calculating the current total sum and mark the positions where the first chunk can end — this will be positions where the sum starting from beginning is from S/4-X to S/4+X.

Now find where the second chunk can end. This is somewhat more tricky. The simplest approach would be to start from each found end position of first chunk and find corresponding possible end positions of the second chunk. But there exist a faster approach. First, start from the first possible end position of the first chunk, and calculate the corresponding end positions of the second chunk. Then, move to the second possible end position of the first chunk. Note that this will only add some new end positions for the second chunk that are to the right of already found positions, so no need to reiterate it all; if you keep the cumulative sum of span covered by 'current' second chunk, then you can find all possible positions of the second chunk in O(N). So you mark all possible end positions of the second chunk.

Similarly find possible end positions of the third chunk, and of the fourth. If the end of array is among the possible end positions of the fourth chunk, then such a division can be done, otherwise no. The division itself can be restored in a simple way, I won't describe it.

Code it like this:

func check(a,S,X) // a is given array
    // canEnd[i,j] is whether the i-th chunk can end just before position j :
    //  canEnd[i,j]==0 --- can not end
    //  canEnd[i,j]==1 --- can end
    //  cadEnd[i,j]==2 --- can end and this is the final possible position
    fill canEnd with zeroes
    canEnd[0,0] = 2
    l = 0  // left end of 'current' chunk
    r = 0  // right end of 'current' chunk (not inclusive)
    curs = 0 // sum of the 'current' chunk
    for i = 1..4
        while true
            last = -1
            while curs <= S/4+X
                if curS > S/4-X
                     canEnd[i,r] = 1
                     last = r
                s +=a[r] 
                r++
            // now processed all chunks that start at l
            if canEnd[i-1,l] == 2
                canEnd[i,last] = 2
                break
            do
                s -= a[l]
                l++
            until canEnd[i-1,l]>0

// main code
left = -1
right = S
while right - left > 1
    middle = (right + left) /2
    if can(middle)
        right = middle
    else left = middle
// the answer is right

(Note that I did not test the code and most probably it contains mistakes, it is here for illustration purposes only.)

For a max(s1,s2,s3,s4)-min(s1,s2,s3,s4) metric a similar approach can be applied, but you will have to iterate first from 0 to S/4 to try each possible value of min(s1,s2,s3,s4). For each possible value of min(s1,s2,s3,s4), do binary search on maximal possible value and you once again have defined range for each s[i].

Upvotes: 1

rossum

Reputation: 15685

If you want four chunks, preserving order, then you have three chunk-boundaries to place. I would start by placing the boundaries evenly, and then iterate shifting each of them by +/-1 to look for improvements. Either backtracking or a genetic algorithm should work. With a list as short as you have, there are not a huge number of different possibilities to try so it should run reasonably quickly.

ETA: possible Pseudocode:

place three boundaries in initial positions
calculate sizes of each chunk between boundaries
boundariesMoved <- true
WHILE (boundariesMoved) DO
  boundariesMoved <- false
  FOR EACH boundary
    check sizes of two adjacent chunks
    test moving boundary 1 step towards larger adjacent chunk
    IF move increased absolute difference between chunks THEN
      leave boundary in original position
    ELSE
      move boundary
      update sizes of affected chunks
      boundariesMoved <- true
    ENDIF
  ENDFOR
ENDWHILE

Upvotes: 1

Margus

Reputation: 20038

This is known as Same size K-Means problem. Usually it refers to 2-d variant where you have simpler - only one dimension case.

The basic idea of the algorithm is as follows:

Initialization:

Compute the desired cluster size, n/k.

Initialize means, preferrably with k-means++

Order points by the distance to their nearest cluster minus distance to the farthest cluster (= biggest benefit of best over worst assignment)

Assign points to their preferred cluster until this cluster is full, then resort remaining objects, without taking the full cluster into account anymore This initialization is not optimal - feel free to contribute an improvement to this tutorial! - in particular for the last cluster. But it will do as an initialization method.

Iteration:

Compute current cluster means

For each object, compute the distances to the cluster means

Sort elements based on the delta of the current assignment and the best possible alternate assignment.

For each element by priority:

For each other cluster, by element gain, unless already moved:

If there is an element wanting to leave the other cluster and this swap yields and improvement, swap the two elements

If the element can be moved without violating size constraints, move it

If the element was not changed, add to outgoing transfer list.

If no more transfers were done (or max iteration threshold was reached), terminate

Source : http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans

Upvotes: 0

Diego Martinoia

Reputation: 4652

If you want something that is NOT optimal, but is easy, fast and good enough (given the distribution is not crazy skewed), I'd suggest you do something like this:

Compute the sum
Divide it by N (how many shards you want)
Greedily take the largest value you can for each partition until your partition sum <= Sum/N. Put all the remaining values in the last partition.

You'll have N-1 partitions that are <= K, and one that will be >= K (K=Sum/N). It's easier than the actual partition problem, and not correct, but given your context this seems acceptable, especially because usually the latter values (which match to letters such as W X Y Z) will have smaller values.

Upvotes: 0

Programmatically Split A Sequence of Numbers Into 4 Relatively Equal Chunks

Answers (4)

Initialization:

Iteration:

Related Questions