Get array of elements from list that sum to value, using Parallel.ForEach (multiple threads)

I'm using the below code to get an array of elements from list that sum to value. However, it only using 1 CPU. My PC has 64 cores so I want to use 100% CPU to speed up the process, so could you please help me update the code to use Parallel.ForEach (multiple threads)?

using System;
using System.Collections.Generic;
using System.Linq;

IEnumerable<List<int>> subset_sum(IEnumerable<int> numbers, int target,
    IEnumerable<int> partial = null, int partial_sum = 0)
{
    partial ??= Enumerable.Empty<int>();
    if (partial_sum == target) yield return partial.ToList();
    if (partial_sum >= target) yield break;
    foreach (var (n, i) in numbers.Select((n, i) => (n, i)))
    {
        var remaining = numbers.Skip(i + 1);
        foreach (var subset in subset_sum(
            remaining, target, partial.Append(n), partial_sum + n))
                yield return subset;
    }
}

var numbers = new List<int> { 3, 9, 8, 4, 5, 7, 10 };
var target = 15;
foreach (var subset in subset_sum(numbers, target))
    Console.WriteLine($"[{string.Join(", ", subset)}] = {subset.Sum()}");

Upvotes: 2

Answers (2)

Theodor Zoulias

Reputation: 43846

Parallelizing recursive algorithms is not easy. It's not obvious how to partition the workload, when the function that does the work can create more work recursively. In this particular case it is possible to partition the work by splitting the numbers into two parts, and using one of the two parts as the partitions. Then for each partition search for combinations in the other part that have a sum equal to the target minus the sum of the partition. Below is an implementation of this idea. The PLINQ library is used for the parallelization:

static IEnumerable<IList<int>> FindSubsets(IEnumerable<int> source, int target)
{
    const int leftSize = 8; // Results in 256 partitions
    int[] sourceArray = source.ToArray();
    int[] left = sourceArray.Take(leftSize).ToArray();
    int[] right = sourceArray.TakeLast(sourceArray.Length - left.Length).ToArray();
    return Partitioner
        .Create(Combinations(left), EnumerablePartitionerOptions.NoBuffering)
        .AsParallel()
        .WithDegreeOfParallelism(Environment.ProcessorCount)
        .WithMergeOptions(ParallelMergeOptions.NotBuffered)
        .SelectMany(leftPart =>
        {
            int leftPartSum = leftPart.Sum();
            return FindSubsets_Internal(right, target - leftPartSum).
                Select(rightPart => leftPart.Concat(rightPart).ToArray());
        });

    static IEnumerable<int[]> Combinations(IEnumerable<int> remaining,
        IEnumerable<int> partial = null)
    {
        partial ??= Enumerable.Empty<int>();
        yield return partial.ToArray();
        foreach (var (n, i) in remaining.Select((n, i) => (n, i)))
        {
            var innerRemaining = remaining.Skip(i + 1);
            var inner = Combinations(innerRemaining, partial.Append(n));
            foreach (var subset in inner) yield return subset;
        }
    }

    static IEnumerable<int[]> FindSubsets_Internal(IEnumerable<int> remaining,
        int target, IEnumerable<int> partial = null, int partialSum = 0)
    {
        partial ??= Enumerable.Empty<int>();
        if (partialSum == target) yield return partial.ToArray();
        if (partialSum >= target) yield break;
        foreach (var (n, i) in remaining.Select((n, i) => (n, i)))
        {
            var innerRemaining = remaining.Skip(i + 1);
            var inner = FindSubsets_Internal(
                innerRemaining, target, partial.Append(n), partialSum + n);
            foreach (var subset in inner) yield return subset;
        }
    }
}

This implementation is not much faster than the original single-thread method, because the recursive algorithm is quite memory-hungry and inefficient. Comparing the value of the GC.GetTotalAllocatedBytes(true) before and after the execution reveals that the memory allocation is massive: around 700 MB per second in my PC. So instead of the bottleneck being the CPU, it's the bandwidth of the memory bus. The CPU cores are mostly waiting for data to be stored or be fetched from the main memory, instead of calculating. At least this is my theory. For comparison parallelizing the non-recursive algorithm that I've posted in my other answer, yields reasonable performance improvements (around 2.4x on my quad core PC).

Upvotes: 1

Theodor Zoulias

Reputation: 43846

This post is not an answer on how to parallelize the subset-finding algorithm. It just shows that a faster single-thread implementation is possible. The implementation below is not recursive. It uses a Stack<int> as a mechanism for finding all the combinations of the source sequence (the pending stack). It is about 7-8 times faster than the recursive implementation in my PC.

IEnumerable<IList<int>> FindSubsets(IEnumerable<int> source, int target)
{
    int[] sourceArray = source.ToArray();
    if (target == 0) yield return Array.Empty<int>();

    List<int> subset = new(sourceArray.Length); // Indices
    int subsetSum = 0;
    Stack<int> pending = new(sourceArray.Length); // Indices
    pending.Push(0);

    while (pending.TryPop(out var index))
    {
        while (subset.Count > pending.Count)
        {
            subsetSum -= sourceArray[subset[subset.Count - 1]];
            subset.RemoveAt(subset.Count - 1);
        }
        for (; index < sourceArray.Length; index++)
        {
            subset.Add(index);
            subsetSum += sourceArray[index];
            if (subsetSum == target)
                yield return subset.Select(i => sourceArray[i]).ToArray();
            if (index + 1 < sourceArray.Length) pending.Push(index + 1);
        }
    }
}

Upvotes: 2

Get array of elements from list that sum to value, using Parallel.ForEach (multiple threads)

Answers (2)

Related Questions