Reputation: 2334
I'm trying to understand why std::for_each
which runs on single thread is ~3
times faster than __gnu_parallel::for_each
in the example below:
Time =0.478101 milliseconds
vs
Time =0.166421 milliseconds
Here the code i'm using to benchmark:
#include <iostream>
#include <chrono>
#include <parallel/algorithm>
//The struct I'm using for timming
struct TimerAvrg
{
std::vector<double> times;
size_t curr=0,n;
std::chrono::high_resolution_clock::time_point begin,end;
TimerAvrg(int _n=30)
{
n=_n;
times.reserve(n);
}
inline void start()
{
begin= std::chrono::high_resolution_clock::now();
}
inline void stop()
{
end= std::chrono::high_resolution_clock::now();
double duration=double(std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count())*1e-6;
if ( times.size()<n)
times.push_back(duration);
else{
times[curr]=duration;
curr++;
if (curr>=times.size()) curr=0;}
}
double getAvrg()
{
double sum=0;
for(auto t:times)
sum+=t;
return sum/double(times.size());
}
};
int main( int argc, char** argv )
{
float sum=0;
for(int alpha = 0; alpha <5000; alpha++)
{
TimerAvrg Fps;
Fps.start();
std::vector<float> v(1000000);
std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
Fps.stop();
sum = sum + Fps.getAvrg()*1000;
}
std::cout << "\rTime =" << sum/5000<< " milliseconds" << std::endl;
return 0;
}
This is my configuration:
gcc version 7.3.0 (Ubuntu 7.3.0-21ubuntu1~16.04)
Intel® Core™ i7-7600U CPU @ 2.80GHz × 4
htop
to check if the program is running in single or multiple threads
g++ -std=c++17 -fomit-frame-pointer -Ofast -march=native -ffast-math -mmmx -msse -msse2 -msse3 -DNDEBUG -Wall -fopenmp benchmark.cpp -o benchmark
The same code doesn't get compiled with gcc 8.1.0. I got that error message:
/usr/include/c++/8/tr1/cmath:1163:20: error: ‘__gnu_cxx::conf_hypergf’ has not been declared
using __gnu_cxx::conf_hypergf;
I already checked couple of posts but either they're very old or not the same issue..
My questions are:
Why is it slower in parallel?
I'm using the wrong functions?
In cppreference it is saying that gcc with Standardization of Parallelism TS
is not supported (mentioned with red color in the table) and my code is running in parallel!?
Upvotes: 1
Views: 1061
Reputation: 11317
Your benchmark is faulty, I'm even surprised it takes time to run it.
You wrote: std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
As v
is a local argument of the operator()
with no reads, I would expect it to become removed by your compiler.
As you now have a loop with a body, that loop can be removed as well as there isn't an observable effect.
And similar to that, the vector can be removed as well as you don't have any readers.
So, without any side effects, this could all be removed. If you would use a parallel algorithm, chances are you have some kind of synchronization, which make optimizing this much harder as there might be side effects in another thread? Proving it doesn't is more complex, not to mention the side effects of the thread management which could exist?
To solve this, a lot of benchmarks have trucks in macros to force the compiler to assume side effects. Use them in the lambda so the compiler doesn't remove it.
Upvotes: 1
Reputation: 238341
Your function [](auto v){ v=0;}
is extremely simple.
The function may be replaced it with a single call to memset
or use SIMD instructions for single threaded parallellism. With the knowledge that it overwrites the same state as the vector initially had, the entire loop could be optimised away. It may be easier for the optimiser to replace std::for_each
than a parallel implementation.
Furthermore, assuming the parallel loop uses threads, one must remember that creation and eventual synchronisation (in this case there is no need for synchronisation during processing) have overhead, which may be significant in relation to your trivial operation.
Threaded parallellism is often only worth it for computationally expensive tasks. v=0
is one of the least computationally expensive operations there are.
Upvotes: 6