Reputation: 840
I have an ordered list of weighted items, weight of each is less-or-equal than N. I need to convert it into a list of clusters. Each cluster should span several consecutive items, and total weight of a cluster has to be less-or-equal than N.
Is there an algorithm which does it while minimizing the total number of clusters and keeping their weights as even as possible?
E.g. list [(a,5),(b,1),(c,2),(d,5)], N=6 should be converted into [([a],5),([b,c],3),([d],5)]
Upvotes: 3
Views: 314
Reputation: 1213
Since the dataset is ordered, one possible approach is to assign a "badness" score to each possible cluster and use a dynamic program reminiscent of Knuth's word wrapping ( http://en.wikipedia.org/wiki/Word_wrap ) to minimize the sum of the badness scores. The badness function will let you explore tradeoffs between minimizing the number of clusters (larger constant term) and balancing them (larger penalty for deviating from the average number of items).
Upvotes: 2
Reputation: 76730
Your problem is under-specified.
The issue is that you are trying to optimize two different properties of the resulting data, and these properties may be in opposition to one another. For a given set of data, it may be that the most even distribution has many clusters, and that the smallest number of clusters has a very uneven distribution.
For example, consider: [(a,1),(b,1),(c,1),(d,1),(e,1)], N=2
The most even distribution is [([a],1),([b],1),([c],1),([d],1),([e],1)]
But the smallest number of clusters is [([a,b],2),([c,d],2),([e],1)]
How is an algorithm supposed to know which of these (or which clustering in between them) you want? You need find some way to quantify the tradeoff that you are willing to accept between number of clusters and evenness of distribution.
You can create an example with an arbitrarily large discrepancy between the two possibilities by creating any set with 2k + 1 elements, and assigning them all the value N/2. This will lead to the smallest number of clusters being k+1 clusters (k of 2 elements and 1 of 1) with a weight difference of N/2 between the largest and smallest clusters. And then the most even distribution for this set will be 2k + 1 clusters of 1 element each, with no weight difference.
Edit: Also, "evenness" itself is not a well-defined idea. Are you looking to minimize the largest absolute difference in weights between clusters, or the mean difference in weights, or the median difference in weights, or the standard deviation in weights?
Upvotes: 1