Reputation: 3181
I have an ordered 1-D array of numbers. Both the array length and the values of the numbers in the array are arbitrary. I want to partition the array into k partitions, according to the number values, e.g. let's say I want 4 partitions, distributed as 30% / 30% / 20% / 20%, i.e. the top 30% values first, the next 30% afterwards, etc. I get to choose k and the percentages of the distribution. In addition, if the same number appears more than once in the array, it should not be contained in two different partitions. This means that the distribution percentages above are not strict, but rather the "goals" or "starting points" if you wish.
For example, let's say my array is ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]
.
I choose k = 4
and the numbers should be distributed into partitions A, B, C and D with percentages pA = pB = pC = pD = 25%
.
Given the constraints I gave above, the resulting partitions should be:
A = [1]
B = [5, 5]
C = [6, 7]
D = [8, 8, 8, 8, 8]
with resulting (achieved/corrected) percentages pcA = 10%, pcB = 20%, pcC = 20%, pcD = 50%
It seems to me that I need a modified k-means algorithm, because the standard algorithm is not guaranteed to respect my percentages and/or the requirement that the same value cannot be in more than one cluster/partition.
So, is there an algorithm for this kind of clustering?
Upvotes: 2
Views: 1030
Reputation:
Here's a dynamic programming solution that finds a partition that minimizes the sum of squares of the errors in the sizes of the parts. So in your example of [1, 5, 5, 6, 7, 8, 8, 8, 8, 8], you want parts of size (2.5, 2.5, 2.5, 2.5) and the result given by this code is (9.0, (1, 2, 2, 5)). That means the partitions chosen were of size 1, 2, 2 and 5, and the total error is 9 = (2.5-1)^2 + (2.5-2)^2 + (2.5-2)^2 + (2.5-5)^2.
def partitions(a, i, sizes, cache):
"""Find a least-cost partition of a[i:].
The ideal sizes of the partitions are stored in the tuple 'sizes'
and cache is used to memoize previously calculated results.
"""
key = (i, sizes)
if key in cache: return cache[key]
if len(sizes) == 1:
segment = len(a) - i
result = (segment - sizes[0]) ** 2, (segment,)
cache[key] = result
return result
best_cost, best_partition = None, None
for j in xrange(len(a) - i + 1):
if 0 < j < len(a) - i and a[i + j - 1] == a[i + j]:
# Avoid breaking a run of one number.
continue
bc, bp = partitions(a, i + j, sizes[1:], cache)
c = (j - sizes[0]) ** 2 + bc
if best_cost is None or c < best_cost:
best_cost = c
best_partition = (j,) + bp
cache[key] = (best_cost, best_partition)
return cache[key]
ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]
sizes = (len(ar) * 0.25,) * 4
print partitions(ar, 0, (2.5, 2.5, 2.5, 2.5), {})
Upvotes: 1
Reputation: 1062
Clustering algorithms are used on multi-dimensional data. For one-dimensional data, you should simply use a sorting algorithm.
Sort the data. Then partition the data-set linearly working from the bottom of the array to the top, as per your example.
Upvotes: 1
Reputation: 1518
The naive approach would go like this:
Say p1...pk are the percentages for your partitions (p1+...+pk = 1)
Say you have N elements in the array
The initial boundaries (there's k+1 of them, including the array ends, since you have k partitions)are: 0, p1*N, (p1+p2)*N, ..., N (there'll be some rounding to do).
For moving the boundaries, you look at the two array elements on each side of a boundary (for the k-1 boundaries that you can move). If the two elements are equal, you need to move to boundary, either left of right, at least until the constraint is satisfied. A naive approach would be to start on the left and do minimal adjustments (just adjust the constraint to the side that causes the least movement, and don't move the boundary any further).
This algorithm doesn't cover the whole space of partitions though. It just gives you one solution. To find the best solution, you'd need to do a brute-force search on the entire partition space, with some kind of pruning (e.g. dynamic programming, where you remember the best partitioning for a subarray of the initial array).
Upvotes: 0