Fly_back
Fly_back

Reputation: 269

Error with 'cuda-memcheck' in cuda 8.0

It is strange that when I do not add cuda-memcheck before ./main, the program runs without any warning or error message, however, when I add it, it will have error message like following.

========= Invalid __global__ write of size 8
=========     at 0x00000120 in initCurand(curandStateXORWOW*, unsigned long)
=========     by thread (9,0,0) in block (3,0,0)
=========     Address 0x5005413b0 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x204115]
=========     Host Frame:./main [0x18e11]
=========     Host Frame:./main [0x369b3]
=========     Host Frame:./main [0x3403]
=========     Host Frame:./main [0x308c]
=========     Host Frame:./main [0x30b7]
=========     Host Frame:./main [0x2ebb]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]

Here is my functions, a brief introduction on the code, I try to generate a random numbers and save them to a device variable weights, then use this vector to sample from discrete numbers.

#include<iostream>
#include<curand.h>
#include<curand_kernel.h>
#include<time.h>

using namespace std;

#define num 100


__device__ float weights[num];

// function to define seed
__global__ void initCurand(curandState *state, unsigned long seed){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    curand_init(seed, idx, 0, &state[idx]);
}


__device__ void sampling(float *weight, float max_weight, int *index, curandState *state){
    int j;
    float u;    
    do{
        j = (int)(curand_uniform(state) * (num + 0.999999)); 
        u = curand_uniform(state); //sample from uniform distribution;
    }while( u > weight[j]/max_weight);
    *index  = j;
}

__global__ void test(int *dev_sample, curandState *state){
    int idx     = threadIdx.x + blockIdx.x * blockDim.x;\
    // generate random numbers from uniform distribution and save them to weights
    weights[idx]    = curand_uniform(&state[idx]);
    // run sampling function, in which, weights is an input for the function on each thread
    sampling(weights, 1, dev_sample+idx, &state[idx]);
}


int main(){ 
    // define the seed of random generator
    curandState *devState;  
    cudaMalloc((void**)&devState, num*sizeof(curandState));

    int *h_sample;
    h_sample    = (int*) malloc(num*sizeof(int));

    int *d_sample;
    cudaMalloc((void**)&d_sample, num*sizeof(float));

    initCurand<<<(int)num/32 + 1, 32>>>(devState, 1);
    test<<<(int)num/32 + 1, 32>>>(d_sample, devState);

    cudaMemcpy(h_sample, d_sample, num*sizeof(float), cudaMemcpyDeviceToHost);

    for (int i = 0; i < num; ++i)
    {
        cout << *(h_sample + i) << endl;
    }

    //free memory
    cudaFree(devState);
    free(h_sample);
    cudaFree(d_sample);
    return 0;
}

Just start to learn cuda, if the methods to access the global memory is incorrect, please help me with that. Thanks

Upvotes: 0

Views: 668

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152143

This is launching "extra" threads:

initCurand<<<(int)num/32 + 1, 32>>>(devState, 1);

num is 100, so the above config will launch 4 blocks of 32 threads each, i.e. 128 threads. But you are only allocating space for 100 curandState here:

cudaMalloc((void**)&devState, num*sizeof(curandState));

So your initCurand kernel will have some threads (idx = 100-127) that are attempting to initialize some curandState that you haven't allocated. As a result when you run cuda-memcheck which does fairly rigorous out-of-bounds checking, an error is reported.

One possible solution would be to modify your initCurand kernel as follows:

__global__ void initCurand(curandState *state, unsigned long seed, int num){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < num)
        curand_init(seed, idx, 0, &state[idx]);
}

This will prevent any out-of-bounds threads from doing anything. Note that you will need to modify the kernel call to pass num to it. Also, it appears to me you have a similar problem in your test kernel. You may want to do something similar to fix it there. This is a common construct in CUDA kernels, I call it a "thread check". You can find other questions here on the SO tag discussing this same concept.

Upvotes: 2

Related Questions