Make CURAND generate different random numbers from a uniform distribution

Question

I am trying to use CURAND library to generate random numbers which are completely independent of each other from 0 to 100. Hence I am giving time as seed to each thread and specifying the "id = threadIdx.x + blockDim.x * blockIdx.x" as sequence and offset . Then after getting the random number as float, I multiply it by 100 and take its integer value.

Now, the problem I am facing is that its getting the same random number for the thread [0,0] and [0,1], no matter how many times I run the code which is 11. I am unable to understand what am I doing wrong. Please help.

I am pasting my code below:

#include 
#include 
#include 
#include
#include "util/cuPrintf.cu"
#include

#define NE WA*HA //Total number of random numbers 
#define WA 2   // Matrix A width
#define HA 2   // Matrix A height
#define SAMPLE 100 //Sample number
#define BLOCK_SIZE 2 //Block size

__global__ void setup_kernel ( curandState * state, unsigned long seed )
{
int id = threadIdx.x  + blockIdx.x + blockDim.x;
curand_init ( seed, id , id, &state[id] );
}

__global__ void generate( curandState* globalState, float* randomMatrix )
{
int ind = threadIdx.x + blockIdx.x * blockDim.x;
if(ind < NE){
    curandState localState = globalState[ind];
    float stopId = curand_uniform(&localState) * SAMPLE;
    cuPrintf("Float random value is : %f",stopId);
    int stop = stopId ;
    cuPrintf("Random number %d
",stop);
    for(int i = 0; i < SAMPLE; i++){
            if(i == stop){
                    float random = curand_normal( &localState );
                    cuPrintf("Random Value %f	",random);
                    randomMatrix[ind] = random;
                    break;
            }
    }
    globalState[ind] = localState;
}
}

/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////

int main(int argc, char** argv)
{

// 1. allocate host memory for matrix A
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float* ) malloc(mem_size_A);
time_t t;

// 2. allocate device memory
float* d_A;
cudaMalloc((void**) &d_A, mem_size_A);

// 3. create random states    
curandState* devStates;
cudaMalloc ( &devStates, size_A*sizeof( curandState ) );

// 4. setup seeds
int n_blocks = size_A/BLOCK_SIZE;
time(&t);
printf("
Time is : %u
",(unsigned long) t);
setup_kernel <<< n_blocks, BLOCK_SIZE >>> ( devStates, (unsigned long) t );
// 4. generate random numbers
cudaPrintfInit();
generate <<< n_blocks, BLOCK_SIZE >>> ( devStates,d_A );
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
// 5. copy result from device to host
cudaMemcpy(h_A, d_A, mem_size_A, cudaMemcpyDeviceToHost);


// 6. print out the results
printf("

Matrix A (Results)
");
for(int i = 0; i < size_A; i++)
{
   printf("%f ", h_A[i]);
   if(((i + 1) % WA) == 0)
      printf("
");
}
printf("
");

// 7. clean up memory
free(h_A);
cudaFree(d_A);

}

Output that I get is :

Time is : 1347857063 [0, 0]: Float random value is : 11.675105[0, 0]: Random number 11 [0, 0]: Random Value 0.358356 [0, 1]: Float random value is : 11.675105[0, 1]: Random number 11 [0, 1]: Random Value 0.358356 [1, 0]: Float random value is : 63.840496[1, 0]: Random number 63 [1, 0]: Random Value 0.696459 [1, 1]: Float random value is : 44.712799[1, 1]: Random number 44 [1, 1]: Random Value 0.735049

Tom · Accepted Answer

There are a few things wrong here, I'm addressing the first ones here to get you started:

General points

Please check the return values of all CUDA API calls, see here for more info.
Please run cuda-memcheck to check for obvious things like out-of-bounds accesses.

Specific points

When allocating space for the RNG state, you should have space for one state per thread (not one per matrix element as you have now).
Your thread ID calculation in setup_kernel() is wrong, should be threadIdx.x + blockIdx.x * blockDim.x (* instead of +).
You use the thread ID as the sequence number as well as the offset, you should just set the offset to zero as described in the cuRAND manual:

For the highest quality parallel pseudorandom number generation, each experiment should be assigned a unique seed. Within an experiment, each thread of computation should be assigned a unique sequence number.

Finally you're running two threads per block, that's incredibly inefficient. Check out the CUDA C Programming Guide, in the "maximize utilization" section for more information, but you should be looking to launch a multiple of 32 threads per block (e.g. 128, 256) and a large number of blocks (e.g. tens of thousands). If you're problem is small then consider running multiple problems at once (either batched in a single kernel launch or as kernels in different streams to get concurrent execution).

Make CURAND generate different random numbers from a uniform distribution

Answers (1)

Related Questions