Nitin Malapally
Nitin Malapally

Reputation: 648

OpenMP-based loop with reduction scales poorly

I have a loop which I'm trying to efficiently parallelize with OpenMP. It involves accumulating the L2 norm of a vector stream, which comes with a reduction. Here's the loop:

struct vec3
{
    float data[3] = {};
};

float accumulate_eta_sq_t_mass(const vec3* etas, const float* masses, const std::size_t& n)
{
    auto a = 0.0;
    #pragma omp parallel for simd safelen(16) reduction(+:a)
    for (auto ii = std::size_t{}; ii < n; ++ii)
    {
        const auto& eta = etas[ii];
        const auto x = static_cast<double>(eta.data[0]);
        const auto y = static_cast<double>(eta.data[1]);
        const auto z = static_cast<double>(eta.data[2]);
        const auto m = static_cast<double>(masses[ii]);
        
        a += (x * x + y * y + z * z) * m;
    }
    
    return static_cast<float>(a);
}

I wrote a micro-benchmarking program (code provided in the appendix) which shows that the simple loop does not scale very well as seen in the following diagram.

enter image description here

Here we're seeing a serial fraction of 15% according to Amdahl's law. In your experience (or perhaps in theory), is it normal for OpenMP loops with a reduction clause to scale so unimpressively? I can provide more information if needed. Thanks in advance.

Note: I'm using double precision in the loop to reduce large rounding errors in the accumulation. Using single precision did not lead to any improvement in scaling performance.


Additional information

@paleonix This was run on the Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz with 6 cores (no SMT) with cache sizes 192 KiB, 1.5 MiB and 9 MiB and highest vector flag AVX2.


Appendix

Here's the minimum reproducer which should compile so g++ -O3 -march=native -ffast-math -std=c++17 -fopenmp ./min_rep.cpp -o min_rep. It is to be called with a sufficiently large size, in my case 10000000.

/// \file This file contains a micro-benchmark program for the SAXPY loop.

#include <string> // std::stoul
#include <utility> // std::abort
#include <iostream> // std::cerr
#include <cstdint> // std::uint64_t
#include <algorithm> // std::min
#include <limits> // std::numeric_limits
#include <vector> // std::vector

#include <omp.h>

#include <stdlib.h> // posix_memalign
#include <unistd.h> // sysconf

struct vec3
{
    float data[3] = {};
};

float accumulate_eta_sq_t_mass(const vec3* etas, const float* masses, const std::size_t& n)
{
    auto a = 0.0;
    #pragma omp parallel for simd safelen(16) reduction(+:a)
    for (auto ii = std::size_t{}; ii < n; ++ii)
    {
        const auto& eta = etas[ii];
        const auto x = static_cast<double>(eta.data[0]);
        const auto y = static_cast<double>(eta.data[1]);
        const auto z = static_cast<double>(eta.data[2]);
        const auto m = static_cast<double>(masses[ii]);
        
        a += (x * x + y * y + z * z) * m;
    }
    
    return static_cast<float>(a);
}

int main(int argc, char** argv)
{
    // extract the problem size
    if (argc < 2)
    {
        std::cerr << "Please provide the problem size as command line argument." << std::endl;
        return 1;
    }
    const auto n = static_cast<std::size_t>(std::stoul(argv[1]));
    if (n < 1)
    {
        std::cerr << "Zero valued problem size provided. The program will now be aborted." << std::endl;
        return 1;
    }
    if (n * sizeof(float) / (1024 * 1024 * 1024) > 40) // let's assume there's only 64 GiB of RAM
    {
        std::cerr << "Problem size is too large. The program will now be aborted." << std::endl;
        return 1;
    }
    
    // report
    std::cout << "Starting runs with problem size n=" << n << ".\nThread count: " << omp_get_max_threads() << "."
        << std::endl;
    
    // details
    const auto page_size = sysconf(_SC_PAGESIZE);
    const auto page_size_vec3 = page_size / sizeof(vec3);
    const auto page_size_float = page_size / sizeof(float);
    
    // experiment loop
    const auto experiment_count = 50;
    const auto warm_up_count = 10;
    const auto run_count = experiment_count + warm_up_count;
    auto durations = std::vector(experiment_count, std::numeric_limits<std::uint64_t>::min());
    
    vec3* etas = nullptr;
    float* masses = nullptr;
    for (auto run_index = std::size_t{}; run_index < run_count; ++run_index)
    {
        // allocate
        const auto alloc_status0 = posix_memalign(reinterpret_cast<void**>(&etas), page_size, n * sizeof(vec3));
        const auto alloc_status1 = posix_memalign(reinterpret_cast<void**>(&masses), page_size, n * sizeof(float));
        if (alloc_status0 != 0 || alloc_status1 != 0 || etas == nullptr || masses == nullptr)
        {
            std::cerr << "Fatal error, failed to allocate memory." << std::endl;
            std::abort();
        }
        
        // "first touch"
        #pragma omp parallel for schedule(static)
        for (auto ii = std::size_t{}; ii < n; ii += page_size_vec3)
        {
            etas[ii].data[0] = 0.f;
        }
        #pragma omp parallel for schedule(static)
        for (auto ii = std::size_t{}; ii < n; ii += page_size_float)
        {
            masses[ii] = 0.f;
        }
        
        // run experiment
        const auto t1 = omp_get_wtime();
        accumulate_eta_sq_t_mass(etas, masses, n);
        const auto t2 = omp_get_wtime();
        const auto duration_in_us = static_cast<std::int64_t>((t2 - t1) * 1E+6);
        if (duration_in_us <= 0)
        {
            std::cerr << "Fatal error, no time elapsed in the test function." << std::endl;
            std::abort();
        }
        if (run_index + 1 > warm_up_count)
        {
            durations[run_index - warm_up_count] = static_cast<std::uint64_t>(duration_in_us);
        }
        
        // deallocate
        std::free(etas);
        std::free(masses);
        etas = nullptr;
        masses = nullptr;
    }
    
    // statistics
    auto min = std::numeric_limits<std::uint64_t>::max();
    auto max = std::uint64_t{};
    auto mean = std::uint64_t{};
    for (const auto& duration : durations)
    {
        min = std::min(min, duration);
        max = std::max(max, duration);
        mean += duration;
    }
    mean /= experiment_count;
    
    // report duration
    std::cout << "Mean duration:      " << mean << " us\n"
        << "Min. duration:      " << min << " us\n"
        << "Max. duration:      " << max << " us.\n";
    
    // compute effective B/W
    const auto traffic = 4 * n * sizeof(float);
    constexpr auto inv_gigi = 1.0 / static_cast<double>(1024 * 1024 * 1024);
    const auto traffic_in_gib = static_cast<double>(traffic) * inv_gigi;
    std::cout << "Traffic per run:    " << traffic << " B (" << traffic_in_gib << " GiB)\n" 
        << "Mean effective B/W: " << static_cast<double>(traffic_in_gib) / (static_cast<double>(mean) * 1E-6) << " GiB/s\n"
        << "Min. effective B/W: " << static_cast<double>(traffic_in_gib) / (static_cast<double>(max) * 1E-6) << " GiB/s\n"
        << "Max. effective B/W: " << static_cast<double>(traffic_in_gib) / (static_cast<double>(min) * 1E-6) << " GiB/s\n"
        << std::endl;
    
    return 0;
}

Upvotes: 3

Views: 138

Answers (1)

Nitin Malapally
Nitin Malapally

Reputation: 648

I would like to offer my own answer. Firstly, I noticed that a roof is being struck at a certain speed-up. The function of interest is a streaming function i.e. it makes no use of caches. This means the DRAM bandwidth will end up being the bottleneck. Using this as roof, we have the following image.

enter image description here

Another effect we see is the non-linear strong scaling. I think (but am not sure) that what we see here is caused by powerful vector units and pre-fetchers; as the core count increases, we very quickly saturate the bandwidth, which means that further increase in core count comes with resource contention.

enter image description here

What causes me to doubt this theory is the fact that the peak performance is not quite as close to the DRAM bandwidth as I'd liked to have seen. But this could be because of the OpenMP reduction...

Upvotes: 5

Related Questions