Reputation: 199
I am testing OpenMP for C++ because my software will be heavily reliant on speed from processor parallelisation.
I am getting strange results when running the following code.
I am using the g++ compiler, version 7.3.0 and Ubuntu 18.04 OS on an i5-8600 CPU with 16 GB RAM.
Outputs:
Output 1 (Not allowed to embed yet since I'm a new member)
Transript:
.../OpenMPTest$ g++ -O3 -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 2.87415 seconds.
Parallel action took: 0.99954 seconds.
.../OpenMPTest$ g++ -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 25.7037 seconds.
Parallel action took: 68.0485 seconds.
As you can see, for 6 processors I'm only getting ~2.9 times improvement in speed, unless I omit the -O flags, in which case the program runs much slower, but still uses all 6 processors at 100% utilisation (tested using htop
).
Why is this? Also, what can I do to achieve the full 6x increase in performance?
Source code:
#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
#include <array>
#include <omp.h>
int main() {
using namespace std::chrono;
const int big_number = 1000000000;
std::array<double, 6> array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
// Sequential
high_resolution_clock::time_point start_linear = high_resolution_clock::now();
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i]++;
}
}
high_resolution_clock::time_point end_linear = high_resolution_clock::now();
// Parallel
high_resolution_clock::time_point start_parallel = high_resolution_clock::now();
array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i]++;
}
}
}
high_resolution_clock::time_point end_parallel = high_resolution_clock::now();
// Stats.
std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;
duration<double> time_span = duration_cast<duration<double>>(end_linear - start_linear);
std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;
time_span = duration_cast<duration<double>>(end_parallel - start_parallel);
std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;
return EXIT_SUCCESS;
}
Upvotes: 3
Views: 330
Reputation: 140
It seems your code was influenced by false sharing.
don't let different threads access the same cache line.A better way is try to not share variables between threads.
#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
#include <array>
#include <omp.h>
int main() {
using namespace std::chrono;
const int big_number = 1000000000;
alignas(64) std::array<double, 6*8> array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
// Sequential
high_resolution_clock::time_point start_linear = high_resolution_clock::now();
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i]++;
}
}
high_resolution_clock::time_point end_linear = high_resolution_clock::now();
// Parallel
high_resolution_clock::time_point start_parallel = high_resolution_clock::now();
array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < 6; i++) {
for(int j = 0; j < big_number; j++) {
array[i*8]++;
}
}
}
high_resolution_clock::time_point end_parallel = high_resolution_clock::now();
// Stats.
std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;
duration<double> time_span = duration_cast<duration<double>>(end_linear - start_linear);
std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;
time_span = duration_cast<duration<double>>(end_parallel - start_parallel);
std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;
return EXIT_SUCCESS;
}
8 processors used.
Linear action took: 26.9021 seconds.
Parallel action took: 6.41319 seconds.
And you can read this.
Upvotes: 1