Memphisd
Memphisd

Reputation: 317

Bottleneck at random number generation with multiple threads

I was facing performance issues while generating random numbers via multiple threads. This was cause of using the same random engine for all threads. Then I implemented a vector which contains a random engine for each thread (found this solution in another post here on stackoverflow). But I would expect that the number of iterations per second grows linearily with the number of threads I'm executing. But this seems not to be the case.

Here is a minimal example:

#include <random>
#include <omp.h>

const int threads = 4;

int main()
{
    std::uniform_int_distribution<uint64_t> uint_dist;
    std::vector<std::mt19937_64> random_engines;
    std::random_device rd;

    for (int i = 0;i < threads;i++)
        random_engines.push_back(std::mt19937_64((rd())));

    omp_set_num_threads(threads);

    int counter = 0;
    #pragma omp parallel for
    for (int i = 0;i < threads;++i)
    {
        int thread = omp_get_thread_num();
        while (counter < 100)
        {
            if (uint_dist((random_engines[thread])) < (1ULL << 42))
                counter++;
        }
    }
}

While executing this code with one active thread it takes an average execution time of ~4 seconds on my CPU. Setting threads to 4 gives me an average execution time of ~2 seconds, so the number of threads gets a multiplicator of 4, which ends up in a speedup of 2. Do I miss something?

Upvotes: 0

Views: 188

Answers (1)

gnasher729
gnasher729

Reputation: 52538

First, if you have two cores and hyper threading, it looks like four processors to your code, but it's not four times the speed, only a bit better than twice as fast if you are lucky.

Second, if you use all the CPU power that you have, your computer will heat up and then reduce the clock speed.

Third, you may be using a random number with huge state. The state for one may fit into L1 cache, but not the state for four of them. That can give a huge slowdown.

Fourth, you have a variable "counter" that is shared between threads and read at each iteration. That's not going to be fast.

Upvotes: 2

Related Questions