a strategy to partition a set to get the minimum sum of variances from subsets

Question

The problem is: I have a set of numbers, and I need to divide it into k subsets. I have to find the best partition strategy to get the minimum sum of the variance from each subsets. No subset can be empty (The variance is the square of standard deviation.)

k is an integer greater than 0. The approximation can be 1e+7

This is my solution so far, which works for some examples, but not always:

sort the sample (a set of numbers) in ascending order.
compute the distance of two continuous elements. Construct a list of lists, the sublist has the index of left element and distance (i.e. [[idx, dist], [idx, dist]......]). sort the list in descending order by distance.
use the indices in the list I have, get indices from left to right to partition the ascending sorted sample

Python code:

class MinimumVariancePartition(object):
    def minDev(self, mixedSamples, k):
        # mixedSamples is a tuple, k is an integer.

        samples_ascending = sorted(mixedSamples)

        # Build a list of lists contains indices and distances.
        idx_dist = []
        for index in range(len(samples_ascending) - 1):
            starting_idx = index
            dist = abs(samples_ascending[index] - samples_ascending[index + 1])
            idx_dist.append([starting_idx, dist])

        sorted_idx_dist = sorted(idx_dist, key=lambda x: x[1], reverse=True)

        # Get a list of indices to split the sample.
        split_idx = []
        for i in range(k - 1):
            split_idx.append(sorted_idx_dist[i][0])

        # Get a list of subsets.    
        partitions = []
        if len(split_idx) == 0:
            partitions.append(mixedSamples)
        else:
            split_idx = sorted(split_idx)
            prev = 0
            for idx in split_idx:
                partitions.append(samples_ascending[prev:idx + 1])
                prev = idx + 1
            partitions.append(samples_ascending[prev:])

        # Compute the sum of variances
        result = 0
        for partition in partitions:
            variance = self.variance(partition)
            result += variance
        return result

    def variance(self, partition):
        # Compute variance of a subset
        size = len(partition)
        s = sum(partition)
        mean = float(s) / size
        variance = 0
        for n in partition:
            temp = round(n - mean, 14)**2  # use round() to avoid float number 'trick'
            variance += temp
        variance /= size
        return variance

tests passed:

input: (3, 4, 7, 10), 1
output: 7.5

input: (1000,500,1,500), 3
output: 0.0

input: (42,234,10,1,123,545,436,453,74,85,34,999), 5
output: 1700.7397959183672

tests failed:

input: (197, 611, 410, 779, 203, 15, 727, 446, 992, 722, 439, 296, 201, 820, 416, 272, 89, 146, 687, 203, 598, 65, 865, 945, 446, 783, 581, 270, 960, 22, 970, 698, 456, 706, 14, 901, 371, 688, 914, 925, 551, 15, 326, 620, 842, 82, 594, 99, 827, 660), 21
expected output: 757.3225
actual output: 824.586388889

input: (359, 408, 124, 89, 26, 878, 677, 341, 166, 434, 886, 539, 227, 420, 655, 330, 835, 378, 763, 401, 883, 332, 215, 424, 365, 841, 113, 825, 777, 969, 970, 668, 602, 708, 874, 930, 423, 549, 236), 13
expected output: 1588.0486111111109
actual output: 2163.79166667

input: (706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956), 4
expected output: 8172.465
actual output: 11259.875

I am thinking the problem in my solution might be in the finding partition index step, but still don't know why it doesn't work.

kraskevich · Accepted Answer

It doesn't work because the idea of your algorithm is not correct(splitting taking into account only the distance between two adjacent elements does not always yield an optimal solution).

You can use dynamic programming instead:
1. Sort the array.
2. Let's assume that f(first_free, sets_count) is the minimum sum of variances if the first_free element is the first element that has not been added to any set yet and exactly sets_count sets have already been created.
3. The base case is f(0, 0) = 0. It corresponds to an empty prefix.
4. Transitions look this way:

for first_free = 0 ... n - 1:
    for new_first_free = first_free + 1 ... n:
        for sets_count = 0 ... k - 1:
            f(new_first_free, sets_count + 1) = min(f(new_first_free, sets_count + 1),
                f(first_free, sets_count) + variance of the subset [first_free, new_first_free - 1])

The answer f(n, k)(where n is the number of elements in the set).

Here is my implementation(it can be optimized, it's just a sketch, but it works properly):

a = [706, 835, 160, 432, 148, 472, 26, 917, 736, 342, 442, 479, 95, 800, 956]
k = 4
mem = dict()
INF = 1e10


def get_variance(partition):
    size = len(partition)
    s = sum(partition)
    mean = float(s) / size
    variance = 0
    for n in partition:
        temp = round(n - mean, 14) ** 2
        variance += temp
    variance /= size
    return variance


def calc(pos, cnt):
    if (pos, cnt) in mem.keys():
        return mem[(pos, cnt)]
    if pos == 0 and cnt >= 0:
        return 0.0
    if cnt < 0:
        return INF
    res = INF
    for old_pos in range(0, pos):
        res = min(res, calc(old_pos, cnt - 1) + get_variance(a[old_pos: pos]))
    mem[(pos, cnt)] = res
    return res


if __name__ == '__main__':
    a.sort()
    print(calc(len(a), k))

a strategy to partition a set to get the minimum sum of variances from subsets

Answers (1)

Related Questions