Reputation: 57
I'm trying to convert c++ code into Cuda code and I've got the following triple nested for loop that will fill an array for further OpenGL rendering (i'm simply creating a coordinate vertices array):
for(int z=0;z<263;z++) {
for(int y=0;y<170;y++) {
for(int x=0;x<170;x++) {
g_vertex_buffer_data_3[i]=(float)x+0.5f;
g_vertex_buffer_data_3[i+1]=(float)y+0.5f;
g_vertex_buffer_data_3[i+2]=-(float)z+0.5f;
i+=3;
}
}
}
I would like to get faster operations and so I'll use Cuda for some operations like the one listed above. I want to create one block for each iteration of the outermost loop and since the inner loops have iterations of 170 * 170 = 28900 total iterations, assign one thread to each innermost loop iteration. I converted the c++ code into this (it's just a small program that i made to understand how to use Cuda):
__global__ void mykernel(int k, float *buffer) {
int idz=blockIdx.x;
int idx=threadIdx.x;
int idy=threadIdx.y;
buffer[k]=idx+0.5;
buffer[k+1]=idy+0.5;
buffer[k+2]=idz+0.5;
k+=3;
}
int main(void) {
int dim=3*170*170*263;
float* g_vertex_buffer_data_2 = new float[dim];
float* g_vertex_buffer_data_3;
int i=0;
HANDLE_ERROR(cudaMalloc((void**)&g_vertex_buffer_data_3, sizeof(float)*dim));
dim3 dimBlock(170, 170);
dim3 dimGrid(263);
mykernel<<<dimGrid, dimBlock>>>(i, g_vertex_buffer_data_3);
HANDLE_ERROR(cudaMemcpy(&g_vertex_buffer_data_2,g_vertex_buffer_data_3,sizeof(float)*dim,cudaMemcpyDeviceToHost));
for(int j=0;j<100;j++){
printf("g_vertex_buffer_data_2[%d]=%f\n",j,g_vertex_buffer_data_2[j]);
}
cudaFree(g_vertex_buffer_data_3);
return 0;
}
Trying to launch it I get a segmenation fault. Do you know what am i doing wrong? I think the problem is that threadIdx.x and threadIdx.y grow at the same time, while I would like to have threadIdx.x to be the inner one and threadIdx.y to be the outer one.
Upvotes: 0
Views: 492
Reputation: 72350
There is a lot wrong here, but the source of the segfault is this:
cudaMemcpy(&g_vertex_buffer_data_2,g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
You either want
cudaMemcpy(&g_vertex_buffer_data_2[0],g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
or
cudaMemcpy(g_vertex_buffer_data_2,g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
Once you fix that you will notice that the kernel is actually never launching with an invalid launch error. This is because a block size of (170,170)
is illegal. CUDA has a 1024 threads per block limit on all current hardware.
There might well be other problems in your code. I stopped looking after I found these two.
Upvotes: 4