Reputation: 11
How would I go about splitting a sequence of numbers into 4 equal (as equal as possible) chunks?
If I have a sequence of integers as such:
16, 4, 17, 10, 15, 4, 4, 6, 7, 14, 9, 17, 27, 6, 1, 9, 0, 12, 20, 8, 0, 3, 4, 0, 3, 4
I want to split that sequence into 4 chunks where the summation of each chunk is as close to the quarter of the summation of the sequence as possible. The total value of the sequence is 220 so I would want chunks roughly equal to 55. The sequence is as such and it's order should not be changed.
Background: The numbers are representative of the number of entries in a phonebook that start with a certain letter. I'm trying to split the phonebook up in the best possible way.
Upvotes: 1
Views: 1046
Reputation: 9997
First of all, you should decide what is the exact value you want to minimize.
Let's denote S
the total sum of the numbers, and s1
, s2
, s3
and s4
the sums of the four parts in some solution.
We can define many exact representations of a rather vague term "as equal as possible". That is, should max(s1,s2,s3,s4)-min(s1,s2,s3,s4)
be as minimal as possible? Or should max(|s1-S/4|, |s2-S/4|, |s3-S/4|, |s4-S/4|)
be as minimal as possible? Or, say, |s1-S/4|+|s2-S/4|+|s3-S/4|+|s4-S/4|
? Etc.
I can think of a simple solution for the second metric: max(|s1-S/4|, |s2-S/4|, |s3-S/4|, |s4-S/4|)
to be minimized.
Firstly, let's solve a different problem. Given your sequence and some value X, can we partition it in such a way that max(|s1-S/4|, |s2-S/4|, |s3-S/4|, |s4-S/4|)<=X
? If we can solve this problem for arbitrary X
, then the initial problem is solved by binary search on X
.
So, how we check whether there exist a partition with max(|s1-S/4|, |s2-S/4|, |s3-S/4|, |s4-S/4|)<=X
? This requirement is equivalent to the requirement S/4-X<=s[i]<=S/4+X
, so for each chunk we know the allowed minimal and maximal sum.
Now go from the beginning calculating the current total sum and mark the positions where the first chunk can end — this will be positions where the sum starting from beginning is from S/4-X
to S/4+X
.
Now find where the second chunk can end. This is somewhat more tricky. The simplest approach would be to start from each found end position of first chunk and find corresponding possible end positions of the second chunk. But there exist a faster approach. First, start from the first possible end position of the first chunk, and calculate the corresponding end positions of the second chunk. Then, move to the second possible end position of the first chunk. Note that this will only add some new end positions for the second chunk that are to the right of already found positions, so no need to reiterate it all; if you keep the cumulative sum of span covered by 'current' second chunk, then you can find all possible positions of the second chunk in O(N)
. So you mark all possible end positions of the second chunk.
Similarly find possible end positions of the third chunk, and of the fourth. If the end of array is among the possible end positions of the fourth chunk, then such a division can be done, otherwise no. The division itself can be restored in a simple way, I won't describe it.
Code it like this:
func check(a,S,X) // a is given array
// canEnd[i,j] is whether the i-th chunk can end just before position j :
// canEnd[i,j]==0 --- can not end
// canEnd[i,j]==1 --- can end
// cadEnd[i,j]==2 --- can end and this is the final possible position
fill canEnd with zeroes
canEnd[0,0] = 2
l = 0 // left end of 'current' chunk
r = 0 // right end of 'current' chunk (not inclusive)
curs = 0 // sum of the 'current' chunk
for i = 1..4
while true
last = -1
while curs <= S/4+X
if curS > S/4-X
canEnd[i,r] = 1
last = r
s +=a[r]
r++
// now processed all chunks that start at l
if canEnd[i-1,l] == 2
canEnd[i,last] = 2
break
do
s -= a[l]
l++
until canEnd[i-1,l]>0
// main code
left = -1
right = S
while right - left > 1
middle = (right + left) /2
if can(middle)
right = middle
else left = middle
// the answer is right
(Note that I did not test the code and most probably it contains mistakes, it is here for illustration purposes only.)
For a max(s1,s2,s3,s4)-min(s1,s2,s3,s4)
metric a similar approach can be applied, but you will have to iterate first from 0 to S/4
to try each possible value of min(s1,s2,s3,s4)
. For each possible value of min(s1,s2,s3,s4)
, do binary search on maximal possible value and you once again have defined range for each s[i]
.
Upvotes: 1
Reputation: 15685
If you want four chunks, preserving order, then you have three chunk-boundaries to place. I would start by placing the boundaries evenly, and then iterate shifting each of them by +/-1 to look for improvements. Either backtracking or a genetic algorithm should work. With a list as short as you have, there are not a huge number of different possibilities to try so it should run reasonably quickly.
ETA: possible Pseudocode:
place three boundaries in initial positions
calculate sizes of each chunk between boundaries
boundariesMoved <- true
WHILE (boundariesMoved) DO
boundariesMoved <- false
FOR EACH boundary
check sizes of two adjacent chunks
test moving boundary 1 step towards larger adjacent chunk
IF move increased absolute difference between chunks THEN
leave boundary in original position
ELSE
move boundary
update sizes of affected chunks
boundariesMoved <- true
ENDIF
ENDFOR
ENDWHILE
Upvotes: 1
Reputation: 20038
This is known as Same size K-Means problem. Usually it refers to 2-d variant where you have simpler - only one dimension case.
The basic idea of the algorithm is as follows:
Initialization:
- Compute the desired cluster size, n/k.
- Initialize means, preferrably with k-means++
- Order points by the distance to their nearest cluster minus distance to the farthest cluster (= biggest benefit of best over worst assignment)
- Assign points to their preferred cluster until this cluster is full, then resort remaining objects, without taking the full cluster into account anymore This initialization is not optimal - feel free to contribute an improvement to this tutorial! - in particular for the last cluster. But it will do as an initialization method.
Iteration:
- Compute current cluster means
- For each object, compute the distances to the cluster means
- Sort elements based on the delta of the current assignment and the best possible alternate assignment.
- For each element by priority:
- For each other cluster, by element gain, unless already moved:
- If there is an element wanting to leave the other cluster and this swap yields and improvement, swap the two elements
- If the element can be moved without violating size constraints, move it
- If the element was not changed, add to outgoing transfer list.
- If no more transfers were done (or max iteration threshold was reached), terminate
Source : http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans
Upvotes: 0
Reputation: 4652
If you want something that is NOT optimal, but is easy, fast and good enough (given the distribution is not crazy skewed), I'd suggest you do something like this:
You'll have N-1 partitions that are <= K, and one that will be >= K (K=Sum/N). It's easier than the actual partition problem, and not correct, but given your context this seems acceptable, especially because usually the latter values (which match to letters such as W X Y Z) will have smaller values.
Upvotes: 0