Slow CUDA kernel

Question

In a nutshell, I wrote a CUDA kernel in python in an attempt to optimize some code, and it runs a lot slower on the GPU than the equivalent code on the CPU. Specifically, on the CPU, the code takes ~ 0.007s (averaged over a few hundred runs), while on the GPU, the best I got was 0.03s if I do the memory management with cuda.to_device / copy_to_host (the timing includes these operations); 0.06s if I do not. I read quite a bit about CUDA but am stuck. Out of ideas.

The code does the following. There are two 4D arrays of np.complex64 values, A[x,y,z,d] and B[x,y,z,d]. d is always fixed, x, y, and z vary depending on the user input (a 3D space grid). In the first step, every element of B is assigned an element of A permutated according to a set of rules and modified (e.g., multiplied by some weights, values added or subtracted, etc.). The rules and the weights are encoded in a set of arrays that are passed to the kernel together with A and B. In a second step, the values of B are returned to A, permutated and modified according to another set of rules.

The CPU code for step 1 contains 4 nested for-loops over x, y, z, and d. For step 2, there are again nested for-loops over x, y, z, but then there are several sequential loops over d because some intermediate values are calculated for each of B[x,y,z] over d.

The CUDA kernel has a

x,y,z = cuda.grid(3)

and a loop over d for the first step, and then the three loops over d, one after the other, for the second step, with cuda.syncthreads() in-between to make sure the values are synchronized. The CUDA kernel and the CPU code give identical results.

The kernel is called 100s of times from within the main loop in the program in a minimization problem. Threads per block and blocks per grid are defined like this:

threadsperblock = (16, 2, 16)
blockspergrid_x = int(math.ceil(A.shape[0] / threadsperblock[0]))+1
blockspergrid_y = int(math.ceil(A.shape[1] / threadsperblock[1]))+1
blockspergrid_z = int(math.ceil(A.shape[2] / threadsperblock[2]))+1
blockspergrid = (blockspergrid_x, blockspergrid_y, blockspergrid_z)

because y is always smaller than x or z, and x is always equal to z.

What I have done re. optimization:

B is only temporary, I only need it inside the kernel, so I made it a cuda.device_array, defined outside the main loop from which the kernel is called to save time on memory transfers. Similarly, every other array that is not modified inside the main loop is passed to the kernel in a cuda.to_device statement located outside the loop. Inside the main loop, there are cuda.to_device statements for arrays passed to the kernel, and copy_to_host only for those arrays that I need later on (like A).

I can't create B inside the kernel because I do not know x, y, z at compile time. At any rate, it seems that memory access is the least of my problems, unless there is a hidden transfer somewhere I am not seeing. How do I test for that?

What I have not done yet is separating real and imaginary parts and processing them separately. That's a lot of work, I'd like to be sure that is the bottleneck before I embark on it.

I am running this on an NVIDIA GeoForce MX550, with Python 3.11.5, Win11. The CPU code is defined with the @jit(nopython = True) decorator, the kernel is defined with the @cuda.jit(fastmath=True) decorator.

Two questions: What would be the best way to figure out where the bottleneck is? Computation? Hidden memory accesses, e.g., if A or B become too large, stuff like that.

Any other CUDA optimization tricks, suggestions? Like I said, I am out of ideas.

Edited: Sample code. I wrote a short program to illustrate the problem.

import numpy as np
import math
from numba import jit, cuda
import time 

# =============================================================================
# CPU CODE
# =============================================================================

@jit(nopython = True)
def CPUTest(a, b, l1):
    l =  np.array([0,2,1,4,3,6,5,8,7,10,9,12,11,14,13,16,15,18,17], dtype = np.int32)
    l[0] =  0; l[1] =  2; l[2] =  1; l[3] =  4; l[4] =  3; l[5] =  6; l[6] =  5; l[7] =  8; l[8] =  7; l[9] =  10; l[10] = 9; l[11] = 12; l[12] = 11; l[13] = 14; l[14] = 13; l[15] = 16; l[16] = 15; l[17] = 18; l[18] = 17;

    nx = a.shape[0]
    ny = a.shape[1]
    nz = a.shape[2]
    nd = a.shape[3] 
    for x in range(nx):
        for y in range (ny):
            for z in range(nz):
                for i in range(nd):
                    a[x,y,z,i] = complex(x, y)
        
                for i in range(nd):
                    b[x,y,z,i] = a[x,y,z,i]*(l[i]-l1[i])
    return b                    

# =============================================================================
# CUDA Kernel
# =============================================================================


@cuda.jit(fastmath=True)
def CUDATest(a,b, l1):
    x,y,z = cuda.grid(3)
    nx = a.shape[0]
    ny = a.shape[1]
    nz = a.shape[2]
    nd = a.shape[3] 
    
    l =  cuda.shared.array(19, dtype = np.int32)
    l[0] =  0; l[1] =  2; l[2] =  1; l[3] =  4; l[4] =  3; l[5] =  6; l[6] =  5; l[7] =  8; l[8] =  7; l[9] =  10; l[10] = 9; l[11] = 12; l[12] = 11; l[13] = 14; l[14] = 13; l[15] = 16; l[16] = 15; l[17] = 18; l[18] = 17;
   
    if (x < nx) and (y < ny) and (z < nz):
        for i in range(nd):
            a[x,y,z,i] = complex(x, y)
        cuda.syncthreads()
        for i in range(nd):
            b[x,y,z,i] = a[x,y,z,i]*(l[i]-l1[i])
                    

k = 500

# =============================================================================
# CUDA
# =============================================================================
testarray = np.zeros((k,15,k,19),dtype = np.complex64)
resultarray = np.zeros((k,15,k,19),dtype = np.complex64)
l =  np.array([0,2,1,4,3,6,5,8,7,10,9,12,11,14,13,16,15,18,17], dtype = np.int32)

threadsperblock = (16, 2, 16)
blockspergrid_x = int(math.ceil(testarray.shape[0] / threadsperblock[0]))
blockspergrid_y = int(math.ceil(testarray.shape[1] / threadsperblock[1]))
blockspergrid_z = int(math.ceil(testarray.shape[2] / threadsperblock[2]))
blockspergrid = (blockspergrid_x, blockspergrid_y, blockspergrid_z)

stream = cuda.stream()    

time1=time.time()
testarray_cuda = cuda.to_device(testarray, stream = stream)
resultarray_cuda = cuda.to_device(resultarray, stream = stream)
l_cuda = cuda.to_device(l, stream = stream)

CUDATest[blockspergrid, threadsperblock, stream](testarray_cuda, resultarray_cuda, l_cuda)

resultarray = resultarray_cuda.copy_to_host(stream=stream) 
CUDAtime = time.time()-time1


print('Output:...................')
print(resultarray)

# =============================================================================
# CPU
# =============================================================================

testarray = np.zeros((k,15,k,19),dtype = np.complex64)
resultarray = np.zeros((k,15,k,19),dtype = np.complex64)

time1=time.time()

resultarray = CPUTest(testarray, resultarray, l)

CPUtime = time.time()-time1

print('Output:...................')
print(resultarray)


print('CUDA time:', CUDAtime)
print('CPU time:', CPUtime)

resultarray is full of zeros in both cases.

CUDA time: 6.6932713985443115
CPU time: 0.49091577529907227

Slow CUDA kernel

Answers (0)

Related Questions