andregalera
andregalera

Reputation: 11

Cupy indexing in 2D Cuda Grid Kernels?

I'm trying to start using Cupy for some Cuda Programming. I need to write my own kernels. However, I'm struggling with 2D kernels. It seems that Cupy does not work the way I expected. Here is a very simple example of a 2D kernel in Numba Cuda:

import cupy as cp
from numba import cuda

@cuda.jit
def nb_add_arrs(x1, x2, y):
  i, j = cuda.grid(2)
  if i < y.shape[0] and j < y.shape[1]:
    y[i, j] = x1[i, j] + x2[i, j]

x1 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
x2 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
y = cp.zeros((5, 5), dtype=cp.int32)
# Grid and block sizes
tpb = (16, 16)
bpg = (x1.shape[0] // tpb[0] + 1, x1.shape[1] // tpb[0] + 1)
# Call kernel
nb_add_arrs[bpg, tpb](x1, x2, y)

The result is, as expected:

y
[[2 2 2 2 2]
 [2 2 2 2 2]
 [2 2 2 2 2]
 [2 2 2 2 2]
 [2 2 2 2 2]]

However, when I try to do this simple kernel in Cupy, I don't get the same.

cp_add_arrs = cp.RawKernel(r'''
extern "C" __global__
void add_arrs(const float* x1, const float* x2, float* y, int N){
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  int j = blockDim.y * blockIdx.y + threadIdx.y;

  if(i < N && j < N){
    y[i, j] = x1[i, j] + x2[i, j];
  }
}
''', 'add_arrs')

x1 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
x2 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
y = cp.zeros((5, 5), dtype=cp.int32)
N = x1.shape[0]
# Grid and block sizes
tpb = (16, 16)
bpg = (x1.shape[0] // tpb[0] + 1, x1.shape[1] // tpb[0] + 1)
# Call kernel
cp_add_arrs(bpg, tpb, (x1, x2, y, cp.int32(N)))

y
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]

Can someone help me figure out why?

Upvotes: 0

Views: 986

Answers (2)

Gediz G&#220;RSU
Gediz G&#220;RSU

Reputation: 636

What may be confusing is that there is only number of blocks and number of threads.

Grid dimensions (bx,by,bz) --> blocks[bx,by,bz]

Block dimensions (tx,ty,tz) ---> threads[tx,ty,tz]

one thread is for example blocks[a,b,c][d,e,f]

Which can be mapped to a counter K look at the example. Fiddle with it if you like :

matmul = cp.RawKernel(r'''
extern "C" __global__
void matmul_kernel(float *A, float *B, float *C) {
    int K = threadIdx.x
      +  blockDim.x  * threadIdx.y
      +  blockDim.x  *  blockDim.y  * threadIdx.z
      +  blockDim.x  *  blockDim.y  *  blockDim.z  * blockIdx.x
      +  blockDim.x  *  blockDim.y  *  blockDim.z  *  gridDim.x  * blockIdx.y
      +  blockDim.x  *  blockDim.y  *  blockDim.z  *  gridDim.x  *  gridDim.y  * blockIdx.z ;

    // Actually there is blockDim and threadDim  all xyz there can only one iterator with this. 
    

      if (K<10000) { 
        if (K%9==0) {C[K]=gridDim.x;}
        if (K%9==1) {C[K]=gridDim.y;}
        if (K%9==2) {C[K]=gridDim.z;}
        if (K%9==3) {C[K]=blockDim.x;}
        if (K%9==4) {C[K]=blockDim.y;}
        if (K%9==5) {C[K]=blockDim.z;}
        if (K%9==6) {C[K]=threadIdx.x;}
        if (K%9==7) {C[K]=threadIdx.y;}
        if (K%9==8) {C[K]=threadIdx.z;}
      C[K]=K; 
      }

}
''', 'matmul_kernel')
x1 = cp.random.random(10000, dtype=cp.float32)
x2 = cp.random.random(10000, dtype=cp.float32)
y = cp.zeros((10000), dtype=cp.float32)
matmul((32,32,32,), (10,10,10,), (x1, x2, y)) # grid, block and arguments  # first of all blocks are huge but threads.xyz are limited to total 1024 10,10,10
max(y)

Upvotes: 0

andregalera
andregalera

Reputation: 11

Memory in C is stored in a row-major-order. So, we need to index following this order. Also, since I'm passing int arrays, I changed the argument types of my kernel. Here is the code:

cp_add_arrs = cp.RawKernel(r'''
extern "C" __global__
void add_arrs(int* x1, int* x2, int* y, int N){
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  int j = blockDim.y * blockIdx.y + threadIdx.y;
  
  if(i < N && j < N){
    y[j + i*N] = x1[j + i*N] + x2[j + i*N];
  }
}
''', 'add_arrs')

x1 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
x2 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
y = cp.zeros((5, 5), dtype=cp.int32)
N = x1.shape[0]
# Grid and block sizes
tpb = (16, 16)
bpg = (x1.shape[0] // tpb[0] + 1, x1.shape[1] // tpb[0] + 1)
# Call kernel
cp_add_arrs(bpg, tpb, (x1, x2, y, cp.int32(N)))

y
array([[2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2]], dtype=int32)

Upvotes: 1

Related Questions