Ja_cpp
Ja_cpp

Reputation: 2334

Why std::for_each is faster than __gnu_parallel::for_each

I'm trying to understand why std::for_each which runs on single thread is ~3 times faster than __gnu_parallel::for_each in the example below:

Time =0.478101 milliseconds

vs

Time =0.166421 milliseconds

Here the code i'm using to benchmark:

#include <iostream>
#include <chrono>
#include <parallel/algorithm>

//The struct I'm using for timming
struct   TimerAvrg
{
    std::vector<double> times;
    size_t curr=0,n;
    std::chrono::high_resolution_clock::time_point begin,end;
    TimerAvrg(int _n=30)
    {
        n=_n;
        times.reserve(n);
    }

    inline void start()
    {
        begin= std::chrono::high_resolution_clock::now();
    }

    inline void stop()
    {
        end= std::chrono::high_resolution_clock::now();
        double duration=double(std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count())*1e-6;
        if ( times.size()<n)
            times.push_back(duration);
        else{
            times[curr]=duration;
            curr++;
            if (curr>=times.size()) curr=0;}
    }

    double getAvrg()
    {
        double sum=0;
        for(auto t:times)
            sum+=t;
        return sum/double(times.size());
    }
};



int main( int argc, char** argv )
{
    float sum=0;
    for(int alpha = 0; alpha <5000; alpha++)
    {
        TimerAvrg Fps;
        Fps.start();
        std::vector<float> v(1000000);
        std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
        Fps.stop();
        sum = sum + Fps.getAvrg()*1000;
    }

    std::cout << "\rTime =" << sum/5000<< " milliseconds" << std::endl;
    return 0;
}

This is my configuration:

gcc version 7.3.0 (Ubuntu 7.3.0-21ubuntu1~16.04) 

Intel® Core™ i7-7600U CPU @ 2.80GHz × 4

htop to check if the program is running in single or multiple threads

g++ -std=c++17 -fomit-frame-pointer -Ofast -march=native -ffast-math -mmmx -msse -msse2 -msse3 -DNDEBUG -Wall -fopenmp benchmark.cpp -o benchmark 

The same code doesn't get compiled with gcc 8.1.0. I got that error message:

/usr/include/c++/8/tr1/cmath:1163:20: error: ‘__gnu_cxx::conf_hypergf’ has not been declared
   using __gnu_cxx::conf_hypergf;

I already checked couple of posts but either they're very old or not the same issue..

My questions are:

Why is it slower in parallel?

I'm using the wrong functions?

In cppreference it is saying that gcc with Standardization of Parallelism TS is not supported (mentioned with red color in the table) and my code is running in parallel!?

Upvotes: 1

Views: 1061

Answers (2)

JVApen
JVApen

Reputation: 11317

Your benchmark is faulty, I'm even surprised it takes time to run it.

You wrote: std::for_each(v.begin(), v.end(),[](auto v){ v=0;});

As v is a local argument of the operator() with no reads, I would expect it to become removed by your compiler. As you now have a loop with a body, that loop can be removed as well as there isn't an observable effect. And similar to that, the vector can be removed as well as you don't have any readers.

So, without any side effects, this could all be removed. If you would use a parallel algorithm, chances are you have some kind of synchronization, which make optimizing this much harder as there might be side effects in another thread? Proving it doesn't is more complex, not to mention the side effects of the thread management which could exist?

To solve this, a lot of benchmarks have trucks in macros to force the compiler to assume side effects. Use them in the lambda so the compiler doesn't remove it.

Upvotes: 1

eerorika
eerorika

Reputation: 238341

Your function [](auto v){ v=0;} is extremely simple.

The function may be replaced it with a single call to memset or use SIMD instructions for single threaded parallellism. With the knowledge that it overwrites the same state as the vector initially had, the entire loop could be optimised away. It may be easier for the optimiser to replace std::for_each than a parallel implementation.

Furthermore, assuming the parallel loop uses threads, one must remember that creation and eventual synchronisation (in this case there is no need for synchronisation during processing) have overhead, which may be significant in relation to your trivial operation.

Threaded parallellism is often only worth it for computationally expensive tasks. v=0 is one of the least computationally expensive operations there are.

Upvotes: 6

Related Questions