Reputation: 11
I have a 3D-image with dimensions 512*512*512. I have to process all the voxels individually. However, I can't get the right dimensions to get the x, y and z-values to get the pixel.
In my kernel I have:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
I am running the program by using:
Kernel<<<dim3(8,8), dim3(8,8,16)>>>();
I chose those because having 64 blocks with each 1024 threads should give me every pixel. However, how do I get the coordinate values when I have those dimensions...
When calling the kernel function I have to set some dimensions that the x, y and z-values actually go from 0 to 511. (This gives me the position of every pixel then). But every combination I try, my kernel either does not run or it runs but the values don't get high enough.
The program should make it possible so that every kernel gets a pixel with (x,y,z) that correspond to that pixel in the image. In most simple way I am trying just to print the coordinates to see if it prints all of them.
Any help?
EDIT:
My properties of my GPU:
Compute capability: 2.0
Name: GeForce GTX 480
My program code just to test it out:
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
// Device code
__global__ void Kernel()
{
// Here I should somehow get the x, y and z values for every pixel possible in the 512*512*512 image
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
printf("Coords: (%i, %i, %i)\n", x, y, z);
}
// Host code
int main(int argc, char** argv) {
Kernel<<<dim3(8, 8), dim3(8,8,16)>>>(); //This invokes the kernel
cudaDeviceSynchronize();
return 0;
}
Upvotes: 1
Views: 1004
Reputation: 152113
To cover a 512x512x512 space with the indexing you have shown (i.e. one thread per voxel) you would need a kernel launch something like this:
Kernel<<<dim3(64,64,64), dim3(8,8,8)>>>();
When I multiply any of the dimensional components:
64*8
I get 512. This gives me a grid of 512 threads in each of 3 dimensions. Your indexing will work with this arrangement as-is to produce one unique thread per voxel.
The above assumes a cc2.0 or higher device (your mention of 1024 threads per block suggests that you have a cc2.0+ device), which permits 3D grids. If you have a cc1.x device, you will need to modify your indexing.
In that case, you might want something like this:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = (blockIdx.y%64) * blockDim.y + threadIdx.y;
int z = (blockIdx.y/64) * blockDim.z + threadIdx.z;
along with a kernel launch like this:
Kernel<<<dim3(64,4096), dim3(8,8,8)>>>();
Here's a fully worked example (cc2.0), based on the code you have now shown:
$ cat t604.cu
#include <stdio.h>
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// Device code
__global__ void Kernel()
{
// Here I should somehow get the x, y and z values for every pixel possible in the 512*512*512 image
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
if ((x==511)&&(y==511)&&(z==511)) printf("Coords: (%i, %i, %i)\n", x, y, z);
}
// Host code
int main(int argc, char** argv) {
cudaFree(0);
cudaCheckErrors("CUDA is not working correctly");
Kernel<<<dim3(64, 64, 64), dim3(8,8,8)>>>(); //This invokes the kernel
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
return 0;
}
$ nvcc -arch=sm_20 -o t604 t604.cu
$ cuda-memcheck ./t604
========= CUDA-MEMCHECK
Coords: (511, 511, 511)
========= ERROR SUMMARY: 0 errors
$
Note that I have elected to only print out one line. I did not want to wade through 512x512x512 lines of printout, it would take a very long time to run, and in-kernel printf is limited in output volume anyway.
Upvotes: 4