disruptive
disruptive

Reputation: 5946

Created Shared Memory Code with Python Cuda

I'm struggling to get some code running to explore the shared memory features to get a fast matrix multiply. But everytime I try this I seem to run into errors that I cannot fathom.

import numpy as np
from numba import cuda, types
m = 128
n = 32
a = np.arange(m*n).reshape(m,n).astype(np.int32)
b = np.arange(m*n).reshape(n,m).astype(np.int32)
c = np.zeros((m, n)).astype(np.int32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)

block_size = (m,n)
grid_size = (int(m/n),int(m/n))


@cuda.jit
def mm(a, b, c):
    column, row = cuda.grid(2)
    sum = 0

    # `a_cache` and `b_cache` are already correctly defined
    a_cache = cuda.shared.array(block_size, types.int32)
    b_cache = cuda.shared.array(block_size, types.int32)


    a_cache[cuda.threadIdx.y, cuda.threadIdx.x] = a[row, column]
    b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[column, row]
    cuda.syncthreads()
    for i in range(a.shape[1]):
        sum += a_cache[row][i] * b_cache[i][column]
    c[row][column] = sum

and testing

mm[grid_size, block_size](d_a, d_b, d_c)
solution = a@b
output = d_c.copy_to_host()

keeps resulting in the following error:

CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR

After chatting with the provider of one answer, I've updated the function. But still cannot make this work. So for the computation of the sum for each element in the output c we need to loop over the columns of A and the rows of B, using i as the index. We have therefore n*n products. I think the i us correct in the sum, but I cannot seem to get the correct index for the row and column of a and b in the expression for the sum.

import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
    column, row = cuda.grid(2)
    sum = 0

    # `a_cache` and `b_cache` are already correctly defined
    a_cache = cuda.shared.array(block_size, types.int32)
    b_cache = cuda.shared.array(block_size, types.int32)


    a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, column]
    b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[row, column]

    cuda.syncthreads()


    for i in range(a.shape[1]):

        sum += a_cache[cuda.threadIdx.x, i] * b_cache[i, cuda.threadIdx.y]

    c[row][column] = sum

Upvotes: 1

Views: 1781

Answers (3)

fccoelho
fccoelho

Reputation: 6204

This is the correct solution:

import numpy as np
from numba import cuda, types
@cuda.jit
def mm_shared(a, b, c):
    sum = 0

    # `a_cache` and `b_cache` are already correctly defined
    a_cache = cuda.shared.array(block_size, types.int32)
    b_cache = cuda.shared.array(block_size, types.int32)

    # TODO: use each thread to populate one element each a_cache and b_cache
    x,y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x
    TPB = int(N)
    
    for i in range(a.shape[1] / TPB):
        a_cache[tx, ty] = a[x, ty + i * TPB]
        b_cache[tx, ty] = b[tx + i * TPB, y]
    
    cuda.syncthreads()
    for j in range(TPB):#a.shape[1]):
        # TODO: calculate the `sum` value correctly using values from the cache 
        sum += a_cache[tx][j] * b_cache[j][ty]
    cuda.syncthreads()    
    c[x][y] = sum

Upvotes: 0

ngenne
ngenne

Reputation: 106

@disruptive: Hi, did you find any solution to your problem? I had the same problem as you but I solved it by restarting the kernel of Jupyter notebook.

My code is slightly different than yours:

def mm_shared(a, b, c):
    sum = 0

    # `a_cache` and `b_cache` are already correctly defined
    a_cache = cuda.shared.array(block_size, types.int32)
    b_cache = cuda.shared.array(block_size, types.int32)

    col, row = cuda.grid(2)

    row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y

    a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row][col]
    b_cache[cuda.threadIdx.y, cuda.threadIdx.x] = b[col][row]

    for i in range(a.shape[1]):
        a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, cuda.threadIdx.y + i * N]
        b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[cuda.threadIdx.x + i * N, col]

        cuda.syncthreads()

        for j in range(N):
            sum += a_cache[cuda.threadIdx.x, j] * b_cache[j, cuda.threadIdx.y]

        # Wait until all threads finish computing
        cuda.syncthreads()

    c[row][col] = sum

Please let me know if you have any update.

Upvotes: 0

talonmies
talonmies

Reputation: 72348

Your block size is invalid. CUDA devices have a limit of 1024 threads per block. When I run your code I see this:

/opt/miniconda3/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
    327                     _logger.critical(msg, _getpid(), self.pid)
    328                     raise CudaDriverError("CUDA initialized before forking")
--> 329             raise CudaAPIError(retcode, msg)
    330 
    331     def get_device(self, devnum=0):

CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

When I fix that I see this:

$ cuda-memcheck python somethingsometing.py

========= CUDA-MEMCHECK
========= Invalid __shared__ read of size 4
=========     at 0x000008b0 in cudapy::__main__::mm$241(Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>)
=========     by thread (15,11,0) in block (3,2,0)
=========     Address 0x00000ec0 is out of bounds

The why is pretty obvious:

for i in range(a.shape[1]):
    sum += a_cache[row][i] * b_cache[i][column]

row and column are dimensions in the execution grid, not the local share memory tile, and similarly i is bounded by the shape of a, not the shape of a_cache (note also that you seemed to lapse in C style 2D array indexing syntax about half way through the code, which is a potential bug if you don't understand the difference between the two in Python).

To fix it you will have to change the indexing and then implement the rest of the code for multiplication (i.e. you must iteratively load the whole row and column slices through the local shared tiles to compute the full dot product for each row/column pair which a block will process).

Note also that

  • The dimensions you have selected for c are wrong (should be m x m)
  • The grid size you run the kernel on is also wrong because the dimensions of C are wrong and so your code could never calculate the whole matrix
  • Even after fixing all of this, it is likely that the results of the multiplication will be incorrect at anything other than trivial sizes because of integer overflow.

Upvotes: 2

Related Questions