cuda kernel is not accessing all the element of an array

Question

I have written a cuda program to do some operation on large array. But when I pass that array to a cuda kernel, then all of its elements are not accessed by threads. Below, there is a simple program explaining my use case:

#include 
#include 

__global__
void kernel(int n){
        int s = threadIdx.x + blockIdx.x*blockDim.x;
        int t = blockDim.x*gridDim.x;
        for(int i=s;i>>(n);
        cudaDeviceSynchronize();
}

I've tried with different blockSize = 256, 128, 64, etc, It is not printing all the indices of array. Ideally, it should print any permutation of 0 to n-1, however it is printing lesser( numbers.



If numBlocks and blockSize both are 1, then it is accessing all the element. And if array size is less than 4096, then also it is accessing all the elements.

Ander Biguri · Accepted Answer

Use better debugging techniques! Your code is properly functional

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include 

#include 

__global__
void kernel(int* in, int n){
    int s = threadIdx.x + blockIdx.x*blockDim.x;
    int t = blockDim.x*gridDim.x;
    for (int i = s; i> >(d_res, n);
    cudaDeviceSynchronize();
    cudaMemcpy(h_res, d_res, n*sizeof(int), cudaMemcpyDeviceToHost);

    int sum = 0;
    for (int i = 0; i < n; i++)
        sum += h_res[i];
    printf("%d", sum);
}

cuda kernel is not accessing all the element of an array

Answers (2)

Related Questions