Reputation: 347
The problem is: I have a set of numbers, and I need to divide it into k subsets. I have to find the best partition strategy to get the minimum sum of the variance from each subsets. No subset can be empty (The variance is the square of standard deviation.)
k is an integer greater than 0. The approximation can be 1e+7
This is my solution so far, which works for some examples, but not always:
Python code:
class MinimumVariancePartition(object):
def minDev(self, mixedSamples, k):
# mixedSamples is a tuple, k is an integer.
samples_ascending = sorted(mixedSamples)
# Build a list of lists contains indices and distances.
idx_dist = []
for index in range(len(samples_ascending) - 1):
starting_idx = index
dist = abs(samples_ascending[index] - samples_ascending[index + 1])
idx_dist.append([starting_idx, dist])
sorted_idx_dist = sorted(idx_dist, key=lambda x: x[1], reverse=True)
# Get a list of indices to split the sample.
split_idx = []
for i in range(k - 1):
split_idx.append(sorted_idx_dist[i][0])
# Get a list of subsets.
partitions = []
if len(split_idx) == 0:
partitions.append(mixedSamples)
else:
split_idx = sorted(split_idx)
prev = 0
for idx in split_idx:
partitions.append(samples_ascending[prev:idx + 1])
prev = idx + 1
partitions.append(samples_ascending[prev:])
# Compute the sum of variances
result = 0
for partition in partitions:
variance = self.variance(partition)
result += variance
return result
def variance(self, partition):
# Compute variance of a subset
size = len(partition)
s = sum(partition)
mean = float(s) / size
variance = 0
for n in partition:
temp = round(n - mean, 14)**2 # use round() to avoid float number 'trick'
variance += temp
variance /= size
return variance
tests passed:
input: (3, 4, 7, 10), 1
output: 7.5
input: (1000,500,1,500), 3
output: 0.0
input: (42,234,10,1,123,545,436,453,74,85,34,999), 5
output: 1700.7397959183672
tests failed:
input: (197, 611, 410, 779, 203, 15, 727, 446, 992, 722, 439, 296, 201, 820, 416, 272, 89, 146, 687, 203, 598, 65, 865, 945, 446, 783, 581, 270, 960, 22, 970, 698, 456, 706, 14, 901, 371, 688, 914, 925, 551, 15, 326, 620, 842, 82, 594, 99, 827, 660), 21
expected output: 757.3225
actual output: 824.586388889
input: (359, 408, 124, 89, 26, 878, 677, 341, 166, 434, 886, 539, 227, 420, 655, 330, 835, 378, 763, 401, 883, 332, 215, 424, 365, 841, 113, 825, 777, 969, 970, 668, 602, 708, 874, 930, 423, 549, 236), 13
expected output: 1588.0486111111109
actual output: 2163.79166667
input: (706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956), 4
expected output: 8172.465
actual output: 11259.875
I am thinking the problem in my solution might be in the finding partition index step, but still don't know why it doesn't work.
Upvotes: 5
Views: 3240
Reputation: 18556
It doesn't work because the idea of your algorithm is not correct(splitting taking into account only the distance between two adjacent elements does not always yield an optimal solution).
You can use dynamic programming instead:
1. Sort the array.
2. Let's assume that f(first_free, sets_count)
is the minimum sum of variances if the first_free
element is the first element that has not been added to any set yet and exactly sets_count
sets have already been created.
3. The base case is f(0, 0) = 0
. It corresponds to an empty prefix.
4. Transitions look this way:
for first_free = 0 ... n - 1:
for new_first_free = first_free + 1 ... n:
for sets_count = 0 ... k - 1:
f(new_first_free, sets_count + 1) = min(f(new_first_free, sets_count + 1),
f(first_free, sets_count) + variance of the subset [first_free, new_first_free - 1])
f(n, k)
(where n
is the number of elements in the set).Here is my implementation(it can be optimized, it's just a sketch, but it works properly):
a = [706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956]
k = 4
mem = dict()
INF = 1e10
def get_variance(partition):
size = len(partition)
s = sum(partition)
mean = float(s) / size
variance = 0
for n in partition:
temp = round(n - mean, 14) ** 2
variance += temp
variance /= size
return variance
def calc(pos, cnt):
if (pos, cnt) in mem.keys():
return mem[(pos, cnt)]
if pos == 0 and cnt >= 0:
return 0.0
if cnt < 0:
return INF
res = INF
for old_pos in range(0, pos):
res = min(res, calc(old_pos, cnt - 1) + get_variance(a[old_pos: pos]))
mem[(pos, cnt)] = res
return res
if __name__ == '__main__':
a.sort()
print(calc(len(a), k))
Upvotes: 5