Reputation: 3003
Trying to get my head around cuda, after not grasping similar stackoverflow questions i decided to test out an example (i'm using cudafy.net for c# but the underlying cuda should be parsable)
I want to do the following. Send a 4x4x4 matrix to the kernel and get a 4x4x4 out according to this logic:
if(input[x,y,z] == 1)
output[x+1, y, z]++;
if(input[x,y,z] == 2)
output[x-1, y, z]++;
I studied the following cudafy example.
public const int N = 1 * 1024;
//Omissions
gpu.Launch(128, 1, function, dev_a, dev_b, dev_c);
kernel:
[Cudafy]
public static void add_0(GThread thread, int[] a, int[] b, int[] c)
{
int tid = thread.blockIdx.x; // (tid 0 -> 127, from my understanding)
while (tid < N)
{
c[tid] = a[tid] + b[tid];
tid += thread.gridDim.x;
}
}
And then tried to transfer it to 3d. I cannot get the indexing right. Say i have the following. (three arrays here just to test indexing)
int size = 4;
int[] dev_delta = gpu.Allocate<int>(size * size * size);
int[] dev_space = gpu.Allocate<int>(size * size * size);
int[] dev_result = gpu.Allocate<int>(size * size * size);
gpu.Launch(new dim3(4, 4, 4), 1, "testIndex", dev_delta, dev_space, dev_result);
And the kernel:
[Cudafy]
public static void testIndex(GThread thread, int[] delta, int[] space, int[] result)
{
int x = thread.blockIdx.x;
int y = thread.blockIdx.y;
int z = thread.blockIdx.z;
delta[x]++;
space[y]++;
result[z]++;
}
Naively I'd expect the following:
delta = {4,4,4,4,0,0,0,0,0, ... 0,0}
space = {4,4,4,4,0,0,0,0,0, ... 0,0}
result = {4,4,4,4,0,0,0,0,0 ... 0,0}
But i get:
delta = {1,1,1,1,0,0,0,0,0, ... 0,0}
space = {1,1,1,1,0,0,0,0,0, ... 0,0}
result = {1,0,0,0,0,0,0,0,0 ... 0,0}
This makes no sense to me, clearly i am missing something.
Questions:
How many threads am i starting?
How do you go about 'indexing' my example problem in 3 dimensions (Starting 4x4x4 threads and getting the variables for flat3DArray[x * sizeY * sizeZ + y * sizeZ + z])?
How do you go about 'indexing' my example problem in 2 dimensions? (Starting 4x4 threads and then let each thread handle a depth column of length 4)
I found this which may be relevant Why is z always zero in CUDA kernel if that is what is messing me up, i'd still appreciate pure-cuda answers to sort my brain out
Upvotes: 1
Views: 323
Reputation: 2916
How many threads am I starting ? You are starting 1 thread per block, hence 16 total since the Z parameter is not used. For better performance, I would recommend also using threads (at least 128, and multiple of 32 anyways).
How do you go about 'indexing' my example problem in 3 dimensions (Starting 4x4x4 threads and getting the variables for flat3DArray[x * sizeY * sizeZ + y * sizeZ + z])?
The second parameter of gpu.Launch
method is for threads. x
, y
and z
could hence be threadIdx.x
, threadIdx.y
and threadIdx.z
respectively. But you may also want to use many blocks, thus threadIdx.x + blockDim.x * blockIdx.x
could be a good peak.
The link you provided here explains why your Z dimension is not relevant. CUDAfy.Net exposes the launch function that further calls cuda runtime CUDA/C API call. When passing parameters from dot net to native environment, it seems that CUDAfy.Net simply ignores the Z argument leaving it to one. (this is most probably due to the fact that early versions of CUDA did not support the Z parameter different than one). The explanation is not pure-cuda because CUDA now supports Z value different than one, but your parameter is simply ignored in the CUDAfy.Net implementation.
Upvotes: 2