duttasankha
duttasankha

Reputation: 737

generating random numbers within cuda kernel

I am writing a cuda program where I need to generate a random variable which would be generated by following a normal distribution. I want the value of the random variable to be limited between 0 to 8. So I want the random variable to get generated within the kernel function and then the random variable result would be used for further use. I am planning to use the cuRAND library for the purpose. I have been trying to use curand_normal device api to generate the values but without any success. It would be extremely helpful if someone could supply me with the kernel function code. Thank you for all your assistance.

The code provided below is the cpu implementation of what I am searching for in the gpu :

  #include "stdafx.h"
    #include <iostream>
    #include <random>

    using namespace std;
    int _tmain(int argc, _TCHAR* argv[])
    {
        const int nrolls=10000;  // number of experiments
        const int nstars=100;    // maximum number of stars to distribute
        int i;
        default_random_engine generator;
        normal_distribution<double> distribution(0.0,3);


       for (i=0;i<=nstars;i++)
       {   int number = distribution(generator);
           printf("%d\n\n",number);
        }


        return 0;
    }

One thing I would like to add that I don't know C++ and I just wrote this program by following other codes which I saw in other sites. Thanks.

Upvotes: 0

Views: 6168

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152259

Here's an adaptation of this code which will produce an approximately "normally"-distributed set of random numbers that can take on discrete values between approximately 0 and 8. I don't understand the request in the comments to have a range of 0 to 8 with a mean of 0.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand_kernel.h>
#include <math.h>
#define SCALE 2.0
#define SHIFT 4.5
#define DISCRETE
#define BLOCKS 1024
#define THREADS 512

#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

__global__ void setup_kernel(curandState *state)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    /* Each thread gets different seed, a different sequence
       number, no offset */
    curand_init(7+id, id, 0, &state[id]);
}



__global__ void generate_normal_kernel(curandState *state,
                                int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    float x;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Generate pseudo-random uniforms */
    for(int n = 0; n < 10; n++) {
        x = (curand_normal(&localState) * SCALE)+SHIFT;
        /* Discretize */
#if defined DISCRETE
        x = truncf(x);
#endif
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store last generated result per thread */
    result[id] = (int) x;
}


int main(int argc, char *argv[])
{
    int i;
    unsigned int total;
    curandState *devStates;
    int *devResults, *hostResults;
    int device;
    struct cudaDeviceProp properties;

    CUDA_CALL(cudaGetDevice(&device));
    CUDA_CALL(cudaGetDeviceProperties(&properties,device));


    /* Allocate space for results on host */
    hostResults = (int *)calloc(THREADS * BLOCKS, sizeof(int));

    /* Allocate space for results on device */
    CUDA_CALL(cudaMalloc((void **)&devResults, BLOCKS * THREADS *
              sizeof(int)));
    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0, THREADS * BLOCKS *
              sizeof(int)));

    /* Allocate space for prng states on device */
    CUDA_CALL(cudaMalloc((void **)&devStates, THREADS * BLOCKS *
                  sizeof(curandState)));

    /* Setup prng states */
    setup_kernel<<<BLOCKS, THREADS>>>(devStates);


    /* Generate and use uniform pseudo-random  */
    generate_normal_kernel<<<BLOCKS, THREADS>>>(devStates, devResults);

    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy(hostResults, devResults, BLOCKS * THREADS *
        sizeof(int), cudaMemcpyDeviceToHost));

    /* Show result */
    if (THREADS*BLOCKS > 20){
      printf("First 20 stored results:\n");
      for (i=0; i<20; i++)
        printf("%d\n", hostResults[i]);
      }

    total = 0;
    for(i = 0; i < BLOCKS * THREADS; i++) {
        total += hostResults[i];
    }
    printf("Results mean = %f\n", (total/(1.0*BLOCKS*THREADS)));



    /* Cleanup */
    CUDA_CALL(cudaFree(devStates));
    CUDA_CALL(cudaFree(devResults));
    free(hostResults);
    return EXIT_SUCCESS;
}

You can easily modify this code to produce a continuous-valued normal distribution (of floats) also.

The two parameters of a normal distribution are mean and standard deviation. These are represented using the SHIFT and SCALE parameters. SHIFT moves the mean from zero. SCALE modifies the standard deviation (from 1.0, to whatever SCALE indicates). So you can play wit the SHIFT and SCALE parameters to get the distribution you want. Note that truncation of the real-valued output of the random number generator affects the statistics. You can adjust for this by adjusting SCALE or SHIFT, or you can switch from the truncf(), to some flavor of rounding.

You can compile this with:

nvcc -arch=sm_20 -o uniform uniform.cu

assuming you have a cc2.0 or higher GPU.

If not, it's ok to compile with:

nvcc -o uniform uniform.cu

The compiler warning that double is being demoted to float in this case is OK to ignore.

THREADS and BLOCKS are arbitrary choices within the limits of the machine. You can modify these to suit your particular launch configuration of your own code.

Upvotes: 4

Related Questions