CUDA Array set to 0 after kernel call

Question

I have a simple program with 3 array, that count how much the third array is 0 and the first and second has same values. when it's true increment another array index. The problems are:

If kernel has only the first if() then function the array A is ever 0
If I insert if() then else function the values of array A is set to 0 after index = 2 and don't count the state when A,B,C=0

this is the code

#include 
#include 
#include 
#include 
#include 

// Kernel that executes on the CUDA device
__global__ void square_array(float *a, float *b, float *c, float *res)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (a[idx]=b[idx] && c[idx]==0) {
        res[0]++;
    }
    else if (a[idx]=b[idx] && c[idx]==1){
        res[1]++;
    }

}

// main routine that executes on the host
int main(void)
{
    float  *a_h, *a_d;  // Pointer to host & device arrays
    float  *b_h, *b_d;  // Pointer to host & device arrays
    float  *c_h, *c_d;  // Pointer to host & device arrays
    float  *res_h, *res_d;  // Pointer to host & device arrays

    const int N = 10;  // Number of elements in arrays
    size_t size = N * sizeof(float);
    //size_t size_s = 4 * sizeof(float);
    a_h = (float *)malloc(size);        // Allocate array on host
    cudaMalloc((void **) &a_d, size);   // Allocate array on device
    b_h = (float *)malloc(size);        // Allocate array on host
    cudaMalloc((void **) &b_d, size);   // Allocate array on device
    c_h = (float *)malloc(size);        // Allocate array on host
    cudaMalloc((void **) &c_d, size);   // Allocate array on device
    res_h = (float *)malloc(size);        // Allocate array on host
    cudaMalloc((void **) &res_d, size);   // Allocate array on device

    // Initialize host array and copy it to CUDA device
    //  for (int i=0; i>> (a_d, b_d, c_d, res_d);
    // Retrieve result from device and store it in host array
    cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
    cudaMemcpy(b_h, b_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
    cudaMemcpy(c_h, c_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
    cudaMemcpy(res_h, res_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

    // Print results
    for (int i=0; i

Gilles · Accepted Answer

Aside from the = in if (a[idx]=b[idx] && c[idx]==0) { that should be == as you already found (and same goes for the following if statement), there are at least two other issues in your code:

You don't check that the thread index doesn't go over the limit of the arrays. So since you are using 2 block of 8 threads, you have 16 threads accessing 10 elements arrays. To avoid the issue, you need to pass N as parameter for your kernel and add a if ( idx < N ) somewhere.
You accumulate in res in parallel without any sort of protection, leading to all kinds of race conditions. This is a very typical histogram issue that is explained aplenty in the literature (web, books, CUDA examples...). A quick fix for you (albeit probably not the most effective one) would be to use atomic operations, such as atomicAdd. In you case, the line res[0]++; would become atomicAdd( &res[0], 1 );, and res[1]++; would become (as you guessed) atomicAdd( &res[1], 1 );. The support of this for float implies you compile your code while using compute capability at least 2.0.

HTH

CUDA Array set to 0 after kernel call

Answers (2)

Related Questions