Adam
Adam

Reputation: 3003

Cuda/cudafy 3d indexing

Trying to get my head around cuda, after not grasping similar stackoverflow questions i decided to test out an example (i'm using cudafy.net for c# but the underlying cuda should be parsable)

I want to do the following. Send a 4x4x4 matrix to the kernel and get a 4x4x4 out according to this logic:

if(input[x,y,z] == 1)
    output[x+1, y, z]++;

if(input[x,y,z] == 2)
    output[x-1, y, z]++;

I studied the following cudafy example.

public const int N = 1 * 1024;

//Omissions

gpu.Launch(128, 1, function, dev_a, dev_b, dev_c);

kernel:

[Cudafy]
public static void add_0(GThread thread, int[] a, int[] b, int[] c)
{
    int tid = thread.blockIdx.x; // (tid 0 -> 127, from my understanding)
    while (tid < N)
    { 
        c[tid] = a[tid] + b[tid];
        tid += thread.gridDim.x;
    }
}

And then tried to transfer it to 3d. I cannot get the indexing right. Say i have the following. (three arrays here just to test indexing)

int size = 4;
int[] dev_delta = gpu.Allocate<int>(size * size * size);
int[] dev_space = gpu.Allocate<int>(size * size * size);
int[] dev_result = gpu.Allocate<int>(size * size * size);

gpu.Launch(new dim3(4, 4, 4), 1, "testIndex", dev_delta, dev_space, dev_result);

And the kernel:

[Cudafy]
public static void testIndex(GThread thread, int[] delta, int[] space, int[] result)
{
    int x = thread.blockIdx.x;
    int y = thread.blockIdx.y;
    int z = thread.blockIdx.z;
    delta[x]++;
    space[y]++;
    result[z]++;
}

Naively I'd expect the following:

delta = {4,4,4,4,0,0,0,0,0, ... 0,0}
space = {4,4,4,4,0,0,0,0,0, ... 0,0}
result = {4,4,4,4,0,0,0,0,0 ... 0,0}

But i get:

delta = {1,1,1,1,0,0,0,0,0, ... 0,0}
space = {1,1,1,1,0,0,0,0,0, ... 0,0}
result = {1,0,0,0,0,0,0,0,0 ... 0,0}

This makes no sense to me, clearly i am missing something.

Questions:

How many threads am i starting?

How do you go about 'indexing' my example problem in 3 dimensions (Starting 4x4x4 threads and getting the variables for flat3DArray[x * sizeY * sizeZ + y * sizeZ + z])?

How do you go about 'indexing' my example problem in 2 dimensions? (Starting 4x4 threads and then let each thread handle a depth column of length 4)

I found this which may be relevant Why is z always zero in CUDA kernel if that is what is messing me up, i'd still appreciate pure-cuda answers to sort my brain out

Upvotes: 1

Views: 323

Answers (1)

Florent DUGUET
Florent DUGUET

Reputation: 2916

How many threads am I starting ? You are starting 1 thread per block, hence 16 total since the Z parameter is not used. For better performance, I would recommend also using threads (at least 128, and multiple of 32 anyways).

How do you go about 'indexing' my example problem in 3 dimensions (Starting 4x4x4 threads and getting the variables for flat3DArray[x * sizeY * sizeZ + y * sizeZ + z])? The second parameter of gpu.Launch method is for threads. x, y and z could hence be threadIdx.x, threadIdx.y and threadIdx.z respectively. But you may also want to use many blocks, thus threadIdx.x + blockDim.x * blockIdx.x could be a good peak.

The link you provided here explains why your Z dimension is not relevant. CUDAfy.Net exposes the launch function that further calls cuda runtime CUDA/C API call. When passing parameters from dot net to native environment, it seems that CUDAfy.Net simply ignores the Z argument leaving it to one. (this is most probably due to the fact that early versions of CUDA did not support the Z parameter different than one). The explanation is not pure-cuda because CUDA now supports Z value different than one, but your parameter is simply ignored in the CUDAfy.Net implementation.

Upvotes: 2

Related Questions