Basti
Basti

Reputation: 2438

Why is this C++11 code containing rand() slower with multiple threads than with one?

I'm trying around on the new C++11 threads, but my simple test has abysmal multicore performance. As a simple example, this program adds up some squared random numbers.

#include <iostream>
#include <thread>
#include <vector>
#include <cstdlib>
#include <chrono>
#include <cmath>

double add_single(int N) {
    double sum=0;
    for (int i = 0; i < N; ++i){
        sum+= sqrt(1.0*rand()/RAND_MAX);
    }
    return sum/N;
}

void add_multi(int N, double& result) {
    double sum=0;
    for (int i = 0; i < N; ++i){
        sum+= sqrt(1.0*rand()/RAND_MAX);
    }
    result = sum/N;
}

int main() {
    srand (time(NULL));
    int N = 1000000;

    // single-threaded
    auto t1 = std::chrono::high_resolution_clock::now();
    double result1 = add_single(N);
    auto t2 = std::chrono::high_resolution_clock::now();
    auto time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
    std::cout << "time single: " << time_elapsed << std::endl;

    // multi-threaded
    std::vector<std::thread> th;
    int nr_threads = 3;
    double partual_results[] = {0,0,0};
    t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < nr_threads; ++i) 
        th.push_back(std::thread(add_multi, N/nr_threads, std::ref(partual_results[i]) ));
    for(auto &a : th)
        a.join();
    double result_multicore = 0;
    for(double result:partual_results)
        result_multicore += result;
    result_multicore /= nr_threads;
    t2 = std::chrono::high_resolution_clock::now();
    time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
    std::cout << "time multi: " << time_elapsed << std::endl;

    return 0;
}

Compiled with 'g++ -std=c++11 -pthread test.cpp' on Linux and a 3core machine, a typical result is

time single: 33
time multi: 565

So the multi threaded version is more than an order of magnitude slower. I've used random numbers and a sqrt to make the example less trivial and prone to compiler optimizations, so I'm out of ideas.

edit:

  1. This problem scales for larger N, so the problem is not the short runtime
  2. The time for creating the threads is not the problem. Excluding it does not change the result significantly

Wow I found the problem. It was indeed rand(). I replaced it with an C++11 equivalent and now the runtime scales perfectly. Thanks everyone!

Upvotes: 44

Views: 5063

Answers (4)

Yakk - Adam Nevraumont
Yakk - Adam Nevraumont

Reputation: 275310

To make this faster, use a thread pool pattern.

This will let you enqueue tasks in other threads without the overhead of creating a std::thread each time you want to use more than one thread.

Don't count the overhead of setting up the queue in your performance metrics, just the time to enqueue and extract the results.

Create a set of threads and a queue of tasks (a structure containing a std::function<void()>) to feed them. The threads wait on the queue for new tasks to do, do them, then wait on new tasks.

The tasks are responsible for communicating their "done-ness" back to the calling context, such as via a std::future<>. The code that lets you enqueue functions into the task queue might do this wrapping for you, ie this signature:

template<typename R=void>
std::future<R> enqueue( std::function<R()> f ) {
  std::packaged_task<R()> task(f);
  std::future<R> retval = task.get_future();
  this->add_to_queue( std::move( task ) ); // if we had move semantics, could be easier
  return retval;
}

which turns a naked std::function returning R into a nullary packaged_task, then adds that to the tasks queue. Note that the tasks queue needs be move-aware, because packaged_task is move-only.

Note 1: I am not all that familiar with std::future, so the above could be in error.

Note 2: If tasks put into the above described queue are dependent on each other for intermediate results, the queue could deadlock, because no provision to "reclaim" threads that are blocked and execute new code is described. However, "naked computation" non-blocking tasks should work fine with the above model.

Upvotes: 3

Nate Kohl
Nate Kohl

Reputation: 35914

As you discovered, rand is the culprit here.

For those who are curious, it's possible that this behavior comes from your implementation of rand using a mutex for thread safety.

For example, eglibc defines rand in terms of __random, which is defined as:

long int
__random ()
{
  int32_t retval;

  __libc_lock_lock (lock);

  (void) __random_r (&unsafe_state, &retval);

  __libc_lock_unlock (lock);

  return retval;
}

This kind of locking would force multiple threads to run serially, resulting in lower performance.

Upvotes: 22

Croniak
Croniak

Reputation: 694

On my system the behavior is same, but as Maxim mentioned, rand is not thread safe. When I change rand to rand_r, then the multi threaded code is faster as expected.

void add_multi(int N, double& result) {
double sum=0;
unsigned int seed = time(NULL);
for (int i = 0; i < N; ++i){
    sum+= sqrt(1.0*rand_r(&seed)/RAND_MAX);
}
result = sum/N;
}

Upvotes: 27

Claudio
Claudio

Reputation: 10947

The time needed to execute the program is very small (33msec). This means that the overhead to create and handle several threads may be more than the real benefit. Try using programs that need longer times for the execution (e.g., 10 sec).

Upvotes: 8

Related Questions