Calculating the sum of a large vector in parallel

Question

Problem background

I have a program that currently takes way too long to sum up large std::vectors of ~100 million elements using std::accumulate, and this is a bottleneck.

I want it to be faster and I want it to be an asynchronous calculation so the GUI/Server doesn't block. The calculation should also be using multithreading so I can reduce the time it takes to sum up a vector.

I want to split up the summation so that each thread sums a part of the vector and then when all partial sums are calculated, each thread's partial sum should be added together to get the total summation.

Boost.Asio?

I was wondering how I could go about this in Boost.Asio? My program ideally needs to reuse the threads (like a thread group), not sure how store and retrieve the partials sums and finally retrieve the sum of the partial sums.

I was thinking of creating a thread group which call boost::asio::io_service::run, passing a handler to compute the partial sums, but I'm not sure how to pass the partials sums to another handler and add all the partial sums together.

It would be great if someone showed some skeleton code of how I can go about this.

Felix Glas · Accepted Answer

Is Boost.Asio suitable for this problem?

The main purpose of Boost.Asio is to provide an asynchronous model for network and I/O programming, and the problem you describe does not seem to have much to do with networking and I/O.

I think that the simplest solution is to use the threading primitives provided by either Boost or the C++ standard library.

A parallel algorithm

Here's an example of a parallel version of accumulate created by only using the standard library.

/* Minimum number of elements for multithreaded algorithm.
   Less than this and the algorithm is executed on single thread. */
static const int MT_MIN_SIZE = 10000;

template 
auto parallel_accumulate(InputIt first, InputIt last, T init) {
    // Determine total size.
    const auto size = std::distance(first, last);
    // Determine how many parts the work shall be split into.
    const auto parts = (size < MT_MIN_SIZE)? 1 : std::thread::hardware_concurrency();

    std::vector> futures;

    // For each part, calculate size and run accumulate on a separate thread.
    for (std::size_t i = 0; i != parts; ++i) {
        const auto part_size = (size * i + size) / parts - (size * i) / parts;
        futures.emplace_back(std::async(std::launch::async,
            [=] { return std::accumulate(first, std::next(first, part_size), T{}); }));
        std::advance(first, part_size);
    }

    // Wait for all threads to finish execution and accumulate results.
    return std::accumulate(std::begin(futures), std::end(futures), init,
        [] (const T prev, auto& future) { return prev + future.get(); });
}

Live example _{(Parallel version performs about the same as sequential on Coliru, probably only 1 core available)}

Timings

On my machine (using 8 threads) the parallel version gave, on average, a ~120 % boost in performance.

Sequential sum:
Time taken: 46 ms
5000000050000000
--------------------------------
Parallel sum:
Time taken: 21 ms
5000000050000000

However, the absolute gain for 100,000,000 elements is only marginal (25 ms). Although, the performance gain might be greater when accumulating a different element type than int.

OpenMP

As mentioned by @sehe in the comments, it is worth mentioning that OpenMP might provide a simple solution to this problem, e.g.

template 
auto omp_accumulate(const std::vector& v, U init) {
    U sum = init;

    #pragma omp parallel for reduction(+:sum)
    for(std::size_t i = 0; i < v.size(); i++) {
        sum += v[i];
    }

    return sum;
}

On my machine this method performed the same as the parallel method using standard thread primitives.

Sequential sum:
Time taken: 46 ms
5000000050000000
--------------------------------
Parallel sum:
Time taken: 21 ms
Sum: 5000000050000000
--------------------------------
OpenMP sum:
Time taken: 21 ms
Sum: 5000000050000000