Reputation: 18219

Why isn't the multithreaded loop faster?

I just read this intro to parallel processing with openMP.

I tried the following simple code

#include <iostream>
#include <ctime>
#include <vector>

int main()
{
        // Create an object just to allow the following loops to do something
        std::vector<int> a;
        a.reserve(2000);

        // First single threaded loop
        std::clock_t begin;
        std::clock_t end;
        begin = std::clock();
        double elapsed_secs;

        for(int n=0; n<1000000000; ++n)
        {
                if (n%100000000 == 0) a.push_back(n);
        }

        end = std::clock();

        elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
        std::cout << "Time for single thread loop: " << elapsed_secs << std::endl;

        // Second multithreaded loop
        begin = std::clock();
        #pragma omp parallel for
        for(int n=0; n<1000000000; ++n)
        {
                if (n%100000000 == 0) a.push_back(n);
        }

        end = std::clock();
        elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
        std::cout << "Time for multi thread loop: " << elapsed_secs << std::endl;

        return 0;
}

which has been compiled with g++ -std=c++11 -o a a.cpp -fopenmp which outputs

Time for single thread loop: 3.9438
Time for multi thread loop: 3.94977

Do I misunderstand how to parallelize in C++
Do I misunderstand how to compile?
Is the code parallelize but the improvement in speed is not-noticable for whatever reason?

Note that I have 12 cores (and no big process currently running) on my machine.

Upvotes: 1

Answers (4)

FranMowinckel

Reputation: 4343

You're not measuring real time but cpu time with std::clock. Better use std::chrono as some other answer suggested.

Or for a quick test without changing your code, try this in a shell:

date; time ./a; date

This was the output:

jue dic 1 23:12:57 CET 2016

Time for single thread loop: 2.99741

Time for multi thread loop: 4.55788

real 0m4.184s

user 0m7.556s

sys 0m0.000s

jue dic 1 23:13:01 CET 2016

The time differs from your output. The real time it's about 4s in my pc and not 7.5s as in the output from your program.

You should read the docs about std::clock(), specifically:

For example, if the CPU is shared by other processes, std::clock time may advance slower than wall clock. On the other hand, if the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock.

Upvotes: 1

Walter

Reputation: 45424

It should also be said that what you're doing in the parallel loop is non-sensical and bound to cause run-time errors. There are two issues.

First

#pragma omp parallel for
for(int n=0; n<1000000000; ++n)
{   if(n%100000000 == 0) <some code>    }

only ever does anything 10 (ten) times. The compiler may optimise the loop variable n away and you're left with code equivalent to

#pragma omp parallel for
for(int n=0; n<10; ++n)
{   <some code>   }

which only benefits from parallelism if <some code> is very computational demanding. So, you're actually not testing anything.

Second, and more seriously,

a.push_back(n);

is not threadsafe. That is, you must not call it (potentially) synchronously from different threads. Each call to std::vector::push_back() changes the state of the vector, i.e. its internal data, causing a race condition.

Finally, I should recommend to not use OpenMP for parallelism with C++, because it does not support/exploit C++ language features (such as templates) and is not even standardized for recent C++ standards. Instead use something like tbb, which is designed for C++.

Upvotes: 0

metalfox

Reputation: 6731

std::clock measures CPU time, not wall time (at least the gcc implementation, though I believe the MSVC implementation measures wall time). This is an excerpt from cppreference:

Returns the approximate processor time used by the process since the beginning of an implementation-defined era related to the program's execution. To convert result value to seconds divide it by CLOCKS_PER_SEC.

Only the difference between two values returned by different calls to std::clock is meaningful, as the beginning of the std::clock era does not have to coincide with the start of the program. std::clock time may advance faster or slower than the wall clock, depending on the execution resources given to the program by the operating system. For example, if the CPU is shared by other processes, std::clock time may advance slower than wall clock. On the other hand, if the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock.

You can measure wall time with the std::chrono facilities:

auto Begin = std::chrono::high_resolution_clock::now();
// ...
auto End = std::chrono::high_resolution_clock::now();
std::cout << "Time for xxx: " << std::chrono::duration_cast<std::chrono::milliseconds>(End - Begin).count() << std::endl;

and you will see the real speedup.

As a side note, I would say that your test is not thread safe, because push_back needs to modify the end position of your vector.

Upvotes: 1

user3164339

Reputation: 166

I think the reason you're 2nd loop was slower is just from the overhead required for threading. The work you're doing is very minimal and tends to be fast on its own. Push_back is constant, the function just uses a pointer to the end of the container to add a new item then it updates the pointer to the 'end'. I think if you were to put more complicated code into the loop you would start to see a difference. I also noticed you never clear 'a' between loops, so you're adding an additional million items in to 'a' which might cause the 2nd loop (even if it wasn't threaded) to run slower.

I don't think you've misunderstood how to thread as much as you just overlooked the overhead (including creating, inti, context swtiches, ext) needed to do the threading.

This question seems to be rather common, but at the same time the answer to each one can be very different as many things go into threading (https://unix.stackexchange.com/questions/80424/why-using-more-threads-makes-it-slower-than-using-less-threads). In that link the answer was more broad, stating that thread speeds are very dependent on system performance such as CPU Resources, RAM resources, and network i/o resources). This link: Why is my multi-threading slower than my single threading? shows that the OP was writing to the console, which is where the problem was (according to the link, the console class handles the thread syncronization so the code in the built in class is what was making the single thread run faster).

Upvotes: 0

Why isn&#39;t the multithreaded loop faster?

Answers (4)

Related Questions

Why isn't the multithreaded loop faster?