Reputation: 158
I have written a cuda program to do some operation on large array. But when I pass that array to a cuda kernel, then all of its elements are not accessed by threads. Below, there is a simple program explaining my use case:
#include <stdio.h>
#include <stdlib.h>
__global__
void kernel(int n){
int s = threadIdx.x + blockIdx.x*blockDim.x;
int t = blockDim.x*gridDim.x;
for(int i=s;i<n;i+=t){
printf("%d\n",i); //printing index of array which is being accessed
}
}
int main(void){
int i,n = 10000; //array_size
int blockSize = 64;
int numBlocks = (n + blockSize - 1) / blockSize;
kernel<<<numBlocks, blockSize>>>(n);
cudaDeviceSynchronize();
}
I've tried with different blockSize = 256, 128, 64, etc
, It is not printing all the indices of array. Ideally, it should print any permutation of 0 to n-1
, however it is printing lesser(<n)
numbers.
If numBlocks
and blockSize
both are 1, then it is accessing all the element. And if array size is less than 4096, then also it is accessing all the elements.
Upvotes: 3
Views: 977
Reputation: 35525
Use better debugging techniques! Your code is properly functional
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
__global__
void kernel(int* in, int n){
int s = threadIdx.x + blockIdx.x*blockDim.x;
int t = blockDim.x*gridDim.x;
for (int i = s; i<n; i += t){
in[i] = 1; //printing index of array which is being accessed
}
}
int main(void){
int i, n = 10000; //array_size
int blockSize = 64;
int numBlocks = (n + blockSize - 1) / blockSize;
int* d_res,*h_res;
cudaMalloc(&d_res, n*sizeof(int));
h_res = (int*)malloc(n*sizeof(int));
kernel << <numBlocks, blockSize >> >(d_res, n);
cudaDeviceSynchronize();
cudaMemcpy(h_res, d_res, n*sizeof(int), cudaMemcpyDeviceToHost);
int sum = 0;
for (int i = 0; i < n; i++)
sum += h_res[i];
printf("%d", sum);
}
Upvotes: 2
Reputation: 16796
Actually, all of the values are being printed in the current case. but you may not be able to see all of them due to buffer limit of the output console. Try increasing the output console's buffer size.
Additionally, keep in mind that the printf
calls inside the kernel execute out-of-order. Also, there are limitations of the printf
buffer on the device which are explained in the documentation.
Upvotes: 3