sds6065
sds6065

Reputation: 205

Algorithm to bucket set of numbers based on buckets sum

I have a set of items that I would like to bucket into N different buckets. Each item has a property associated with it (size) and I would like the sum of this property in each bucket to be roughly equal. What is the best way to determine this? Note the range of the size on the items is fairly large, in the data set I'm using the smallest size is 1 and the largest is 325,220.

Example:

Item A - size 5
Item B - size 10
Item C - size 8
Item D - size 16
Item E - size 7

If I wanted to group these into 3 buckets I would want

Bucket 1: A, B
Bucket 2: C, E
Bucket 3: D

Upvotes: 2

Views: 1164

Answers (2)

sds6065
sds6065

Reputation: 205

I ended up implementing the complete greedy algorithm described in the paper linked by Joe Farrel. The full C# code I used is below:

public class Item
{
    public int Id { get; }
    public int Size { get; }

    public Item(int id, int size)
    {
        Id = id;
        size = size;
    }
}

public class Partition
{
    public int Index { get; }
    public ImmutableList<Item> Items { get; } = ImmutableList<Item>.Empty;
    public int Sum { get; }

    public Partition(int index)
    {
        Index = index;
    }

    private Partition(int index, ImmutableList<Item> items, int sum)
    {
        Index = index;
        Item = items;
        Sum = sum;
    }

    public Partition Add(Item item) => new Partition(Index, Items.Add(item), Sum + item.Size);

    public static double AverageDifference(ImmutableList<Partition> partitions)
    {
        var differences = new List<int>();
        for (var i = 0; i < partitions.Count; i++)
        {
            var partition = partitions[i];
            var otherPartitions = partitions.RemoveAt(i);
            foreach (var otherPartition in otherPartitions)
            {
                differences.Add(Math.Abs(partition.Sum - otherPartition.Sum));
            }
        }

        return differences.Average();
    }
}

public class Node
{
    public Item Item { get; set; }
    public int Partition { get; set; }
    public Node[] Children { get; set; }
}

private (Node tree, int totalSum) InitTree(IEnumerable<Item> items)
{
    var root = new Node();
    var totalSum = 0;
    Node[] previousLevel = {root};
    foreach (var item in items.OrderByDescending(i => i.Size))
    {
        totalSum += item.Size;
        var currentLevel = new Node[_numPartitions];
        for (var i = 0; i < _numPartitions; i++)
        {
            currentLevel[i] = new Node
            {
                Item = item,
                Partition = i
            };
        }

        foreach (var node in previousLevel)
        {
            node.Children = currentLevel;
        }

        previousLevel = currentLevel;
    }

    return (root, totalSum);
}

private ImmutableList<Partition> GetPartitions(Node tree, int totalSum)
{
    var partitions = ImmutableList<Partition>.Empty;
    for (var i = 0; i < _numPartitions; i++)
    {
        partitions = partitions.Add(new Partition(i));
    }

    return TraverseTree(tree, partitions, totalSum, double.MaxValue, ImmutableList<Partition>.Empty);
}

private ImmutableList<Partition> TraverseTree(Node node, ImmutableList<Partition> partitions, int totalSum, double bestDifference, ImmutableList<Partition> bestPartitions)
    {
        var currentPartitions = partitions;
        if (node.Item != null) // skip root
        {
            // place item into its partition
            var updatedPartition = currentPartitions[node.Partition].Add(node.Item);
            currentPartitions = currentPartitions.SetItem(node.Partition, updatedPartition);
        }

        // if this is a leaf, partition is complete
        if (node.Children == null)
        {
            return currentPartitions;
        }

        // terminate path if partition is sufficiently bad
        var largestSum = currentPartitions.Max(p => p.Sum);
        if (largestSum - (totalSum - largestSum) / (_numPartitions - 1) >= bestDifference)
        {
            return null;
        }

        // contintue to traverse tree in ascending partition size order
        foreach (var partition in currentPartitions.OrderBy(p => p.Sum))
        {
            var nextNode = node.Children[partition.Index];
            var nextPartitions = TraverseTree(nextNode, currentPartitions, totalSum, bestDifference, bestPartitions);

            if (nextPartitions == null) // path was terminated
            {
                continue;
            }

            // if we hit a perfect parition set, return it
            var nextDifference = Partition.AverageDifference(nextPartitions);
            if (nextDifference <= 1)
            {
                return nextPartitions;
            }

            // hold on to the best partition
            if (nextDifference < bestDifference)
            {
                bestDifference = nextDifference;
                bestPartitions = nextPartitions;
            }
        }

        return bestPartitions;
    }

    _numPartitions = 4
    var items = GetItems()
    var (tree, totalSum) = InitTree(items);
    var partitions = GetPartitions(tree, totalSum);

Upvotes: 1

Paddy3118
Paddy3118

Reputation: 4772

My answer to this question on laying out pictures might be able to be adapted. Pictures with height become items with size. Columns per page becomes buckets.

The algorithm has three parts: first-fit; greedy-swapping; and reverse sorting, some may be of more use than others with your data.

Upvotes: 0

Related Questions