c++parallel-processingc++17openmpmontecarlo

Reputation: 21

Jumping ahead in parallelised PRNGs in C++

I am implementing a Monte Carlo simulation, where I need to run multiple realisations of some dynamics and then take an average over the end state for all the simulations. Since the number of realisation is large, I run them in parallel using OpenMP. Every realisation starts from the same initial conditions and then at each time step a process happens with a given probability, and to determine which process I draw a random number from a uniform distribution.

I want to make sure that all simulations are statistically independent and that there is no overlap in the random numbers that are being drawn.

I use OpenMP to parallelise the for loops, so the skeleton code looks like this:

vector<int> data(number_of_sims);
double t;
double r;
#pragma omp parallel for
for(int i = 0; i < number_of_sims; i++){

    // run sim
    t = 0;
    while (t < T) {
        r = draw_random_uniform();
        if (r < p) do_something();
        else do_something_else();
        t += 1.0;  // increment time
    }

    // some calculation
    data[i] = calculate();
}

So every time I want a random number, I would call a function which used the Mersenne Twister seeded with random device.

double draw_random_uniform(){
   static thread_local auto seed = std::random_device{}();
   static thread_local mt19937 mt(seed);
   std::uniform_real_distribution<double> distribution(0.0, 1.0);
   double r = distribution(mt);
   return r;
}

However, since I ultimately want to run this code on a high power computing cluster I want to avoid using std::random_device() as it is risky for systems with little entropy.

So instead I want to create an initial random number generator and then jump it forward a large amount for each of the threads. I have been attempting to do this with the Xoroshiro256+ PRNG (I found some good implementation here: https://github.com/Reputeless/Xoshiro-cpp). Something like this for example:

XoshiroCpp::Xoshiro256Plus prng(42);  // properly seeded prng
#pragma omp parallel num_threads()
{
    static thread_local XoshiroCpp::Xoshiro256Plus lprng(prng);  // thread local copy
    lprng.longJump();  // jump ahead

    // code as before, except use lprng to generate random numbers
    # pragma omp for
    ....
}

However, I cannot get such an implementation to work. I suspect because of the double OpenMP for loops. I had the thought of pre-generating all of the PNRGs and storing in a container, then accessing the relevant one by using omp_get_thread_num() inside the parallelised for loop.

I am unsure if this is the best way to go about doing all this. Any advice is appreciated.

Upvotes: 2

Answers (3)

Tom

Reputation: 129

For large-scale parallel Monte Carlo simulations Collatz-Weyl generators would be perfect:

https://arxiv.org/abs/2312.17043

This is two line PRNG named CWG128 (there is also a smaller 64-bit generator):

static __uint128_t c[4]; // c[0] must be odd

__uint128_t CWG128(void){
c[1] = (c[1] >> 1) * ((c[2] += c[1]) | 1) ^ (c[3] += c[0]);
return c[2] >> 96 ^ c[1];
}

There is no risk of overlapping because each c[0] (must be odd) initializes the generator with a separate, independent cycle. For full independence of the streams from the first generated number (generators initialized with similar numbers, e.g. 1, 3, 5, ... may exhibit small correlations in the initial values), one of two initialization methods should be used.

In the first method you can initialize the generators with consecutive odd numbers (specifically c[0], the remaining variables can be equal to 0), but then you have to skip first 96 outputs. You can use for this purpose that function:

void CWG128_initializer(void){
for (int i = 0; i < 96; i++){
c[1] = (c[1] >> 1) * ((c[2] += c[1]) | 1) ^ (c[3] += c[0]);
}
}

In the second method you can initalize your generators by:

c[1] = splitmix64();
c[0] = ((__uint128_t)splitmix64() << 64) | ((splitmix63() << 1) | 1);

And here is splitmix64() and splitmix63():

static uint64_t y; /* The state can be seeded with any value. */

uint64_t splitmix64() {
uint64_t z = (y += 0x9e3779b97f4a7c15);
z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
return z ^ (z >> 31);}

static uint64_t y; /* The state can be seeded with any value. */

uint64_t splitmix63() {
uint64_t z = (y += 0x9e3779b97f4a7c15) & 0x7fffffffffffffff;
z = ((z ^ (z >> 30)) * 0xbf58476d1ce4e5b9) & 0x7fffffffffffffff;
z = ((z ^ (z >> 27)) * 0x94d049bb133111eb) & 0x7fffffffffffffff;
return z ^ (z >> 31);}

You set y to any value, and then use:

c[1] = splitmix64();
c[0] = ((__uint128_t)splitmix64() << 64) | ((splitmix63() << 1) | 1);

to serially initialize subsequent generators. You can initialize 2^63 independent, non-overlapping streams by splimixes method (splitmixes cannot generate more numbers) or 2^127 indepenedent, non-overlapping streams by initializing c[0] by 1, 3, 5, ... (and c[1] = 0, c[2] = 0, c[3] = 0) and skipping first 96 outputs of the streams for full independence from the first number.

Upvotes: 0

Jim Cownie

Reputation: 2869

You should read "Parallel Random Numbers, as easy as one, two three" http://www.thesalmons.org/john/random123/papers/random123sc11.pdf This paper explicitly addresses your forward stepping issues. You can now find implementations of this generator in maths libraries (such as Intel's MKL, which uses the specialized encryption instructions, so will be hard to beat by hand!)

Upvotes: 0

Chris

Reputation: 2860

Coordinating random number generators with long jump can be tricky. Alternatively there is a much simpler method.

Here is a quote from the authors website:

It is however important that the period is long enough. Moreover, if you run n independent computations starting at random seeds, the sequences used by each computation should not overlap.

Now, given a generator with period P, the probability that
subsequences of length L starting at random points in the state space overlap is bounded by n² L/P. If your generator has period 2^256 and you run on 2^64 cores (you will never have them) a computation using 2^64 pseudorandom numbers (you will never have the time) the probability of overlap would be less than 2^-64.

So instead of trying to coordinate, you could in each thread just randomly seed a new generator from std::random_device{}. The period is so large that it will not collide.

While this sounds like a very add-hock approach, this random-seeding method is actually a widely used and classic method.

You just need to make sure the seeds are different. Depending on the platform usually different random seeds are proposed.

Using a truly random source
Having an atomic int that is incremented and some hashing
Using another pseudo random number generator to generate a seed sequence
Using a combination of thread id and time to create a seed

If repeatability is not needed, seeds from a random source is the most easiest and safest solution.

The paper from L'Ecuyer et. al. from 2017 gives a good overview of methods for generating parallel streams. He calls this approach "RNG with a “random” seed for each stream` under chapter 4.

vector<int> data(number_of_sims);
double t;
double r;
#pragma omp parallel for
for(int i = 0; i < number_of_sims; i++){
    // random 128 bit seed
    auto rd = std::random_device{};
    auto seed = std::seed_seq {rd(), rd(), rd(), rd()};
    auto mt = std::mt19937 {seed};

    // run sim
    t = 0;
    while (t < T) {
        r = draw_random_uniform(mt);
        if (r < p) do_something();
        else do_something_else();
        t += 1.0;  // increment time
    }

    // some calculation
    data[i] = calculate();
}

and

double draw_random_uniform(mt19937 &mt){
   std::uniform_real_distribution<double> distribution(0.0, 1.0);
   return distribution(mt);
}

If number_of_sims is not extremely large there is no need for static or thread_local initialization.

Upvotes: 2

Jumping ahead in parallelised PRNGs in C++

Answers (3)

Related Questions