Cokes
Cokes

Reputation: 4123

Generating Uniform Double random numbers on device in CUDA

I would like to generate uniform random numbers on the device, to be used inside of a device function. Each thread should generate a different uniform random number. I have this code, but I get a segmentation fault.

int main{
  curandStateMtgp32 *devMTGPStates;
  mtgp32_kernel_params *devKernelParams;

  cudaMalloc((void **)&devMTGPStates, NUM_THREADS*NUM_BLOCKS * sizeof(curandStateMtgp32));
  cudaMalloc((void**)&devKernelParams,sizeof(mtgp32_kernel_params));

  curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, devKernelParams);
  curandMakeMTGP32KernelState(devMTGPStates,
    mtgp32dc_params_fast_11213, devKernelParams,NUM_BLOCKS*NUM_THREADS, 1234);

  doHenry <<NUM_BLOCKS,NUM_THREADS>>> (devMTGPStates);
}

Inside my global function doHenry, evaluated on the device, I put:

double rand1 = curand_uniform_double(&state[threadIdx.x+NUM_THREADS*blockIdx.x]);

Is this the best way to generate a random number per thread? I don't understand what the devKernelParams is doing, but I know I need one state per thread, right?

Upvotes: 1

Views: 1196

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152259

I think you're getting the seg fault on this line:

curandMakeMTGP32KernelState(devMTGPStates,  mtgp32dc_params_fast_11213, devKernelParams,NUM_BLOCKS*NUM_THREADS, 1234);

I believe the reason for the seg fault is because you have exceeded 200 for the n parameter, for which you are passing NUM_BLOCKS*NUM_THREADS. I tried a version of your code, and I was able to reproduce the seg fault at around n=540.

The MT generator has a limitation on the amount of states it can set up when using pre-generated kernel parameters (mtgp32dc_params_fast_11213). You may wish to read the relevant section of the documentation. (Bit Generation with the MTGP32 generator)

I'm not really an expert on CURAND, but other generators (such as XORWOW) don't have this type of limitation, so if you want to generate a large amount of independent thread state easily, consider one of the other generators. Using the particular approach you have outlined, the MTGP32 generator seems to be limited to about 200*256 independent thread generation. Contrary to what I said in the comments (which is true for other generator types) the MTGP32 state seems to be sufficient at one state for a block of up to 256 threads. And the example given in the documentation (refer to the second example) uses that type of state generation and threadblock hierarchy.

Upvotes: 2

Related Questions