How to use WMMA functions in Cupy kernels?

Question

How to use WMMA functions such as wmma::load_matrix_sync in cupy.RawKernel or cupy.RawModule? can someone provide a minimal example?

Robert Crovella · Accepted Answer

We can combine information on cupy RawKernel and wmma programming to provide most of the needed material. I don't intend to give a tutorial on wmma programming, there are other resources for that such as this blog and the cutlass template library.

Note that the wmma functions require compute capability 7.0 or higher. You must run on a Volta, Turing, or Ampere GPU.

Let's take the kernel example given in the programming guide. To put this in a RawKernel, we need to provide it as a string. In order to support the use of the kernel C-style, I have broken the kernel code into a __device__ function that can use C++, while exporting the kernel entry point (wmma_ker) using C-style linkage. The example code performs a 16x16 matrix multiply (using a single warp). Here is a worked example:

# cat t24.py
import numpy
import cupy as cp
ddim = 16
bdim = 32
gdim = 1
a = cp.ones(ddim*ddim, dtype=cp.float16)
b = cp.ones(ddim*ddim, dtype=cp.float16)
c = cp.zeros(ddim*ddim, dtype=cp.float32)
wmma_ker = cp.RawKernel(r'''
  #include 
  using namespace nvcuda;
  __device__ void wmma_ker_dev(half *a, half *b, float *c) {
  // Declare the fragments
    wmma::fragment a_frag;
    wmma::fragment b_frag;
    wmma::fragment c_frag;

  // Initialize the output to zero
    wmma::fill_fragment(c_frag, 0.0f);

  // Load the inputs
    wmma::load_matrix_sync(a_frag, a, 16);
    wmma::load_matrix_sync(b_frag, b, 16);

  // Perform the matrix multiplication
    wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

  // Store the output
    wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
  }
  extern "C" {
    __global__ void wmma_ker(half *a, half *b, float *c) {
          wmma_ker_dev(a,b,c);
    }
  }
 ''', 'wmma_ker', options=("-restrict","-lineinfo"))
wmma_ker((gdim,1), (bdim,1), (a,b,c))  # grid, block and arguments
r_o = cp.asnumpy(c)
print(r_o)
# cuda-memcheck python t24.py
========= CUDA-MEMCHECK
[16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
 16. 16. 16. 16.]
========= ERROR SUMMARY: 0 errors
#

I used pip install cupy-cuda102 to set up cupy for this, otherwise running on a machine with CUDA 10.2 installed, and a Tesla V100 GPU. The RawKernel options I have provided are unnecessary for this demonstration, you could omit that argument entirely.

The purpose of this code is to demonstrate an example method. I'm not suggesting the code is defect free or suitable for any particular purpose. Use it at your own risk. In particular, I would not expect this code to work correctly if any aspect of it is changed. I am not suggesting that it is a general/flexible/extensible matrix multiply routine.

How to use WMMA functions in Cupy kernels?

Answers (1)

Related Questions