Numba python CUDA vs. cuBLAS speed difference on simple operations

Question

I'm profiling some code and can't figure out a performance discrepancy. I'm trying to do a simple element-wise addition between two arrays (in-place). This is the CUDA kernel using numba:

from numba import cuda

@cuda.jit('void(float32[:], float32[:])')
def cuda_add(x, y):

    ix = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    stepSize = cuda.gridDim.x * cuda.blockDim.x
    while ix < v0.shape[0]:
        y[ix] += x[ix]
        ix += stepSize

I thought the performance was fine, but then I compared it to the cuBLAS method:

from accelerate.cuda.blas import Blas

blas = Blas()
blas.axpy(1.0, X, Y)

The performance of the BLAS method is roughly 25% faster for large arrays (20M elements). This is after "warming up" the cuda.jit kernel by previously calling it so the compiled PTX code is already cached (not sure if this matters but did it just to make sure that wasn't the issue).

I could understand this performance difference for level 3 matrix-matrix operations, but this is a simple addition. Is there something i can do to squeeze more performance out of the cuda.jit code? I'm asking because the real code I want to optimize is a 2d array, which can't be passed to blas.axpy.

EDIT Execution code and other needed packages:

import numpy as np

def main():
    n = 20 * 128 * 128 * 64
    x = np.random.rand(n).astype(np.float32)
    y = np.random.rand(n).astype(np.float32)

    ##  Create necessary GPU arrays
    d_x = cuda.to_device(x)
    d_y = cuda.to_device(y)

    ##  My function
    cuda_add[1024, 64](d_x , d_y)

    ##  cuBLAS function
    blas = Blas()
    blas.axpy(1.0, d_x , d_y)

talonmies · Accepted Answer

The very short answer is no. CUBLAS leverages a number of things (textures, vector types) to improve the performance of memory bound code like this which the numba CUDA dialect doesn't presently support.

I dashed off this in CUDA:

__device__ float4 add(float4 x, float4 y) 
{
    x.x += y.x; x.y += y.y; x.z += y.z; x.w += y.w; 
    return x;
} 

__global__ void mykern(float* x, float* y, int N)
{
    float4* x4 = reinterpret_cast(x);
    float4* y4 = reinterpret_cast(y);

    int strid = gridDim.x * blockDim.x;
    int tid = threadIdx.x + blockDim.x * blockIdx.x;

    for(; tid < N/4; tid += strid) {
        float4 valx = x4[tid];
        float4 valy = y4[tid];
        y4[tid] = add(valx, valy);
    }       
}

and my benchmarking shows it to be within about 5% of CUBLAS, but I don't believe you can do that in numba at the moment.

As an aside, I don't understand your remark about not being able to run saxpy on a 2D array. If you arrays are contiguous in memory (as I suspect they must be) and have the same layout (i.e. not trying to add a transpose), then you can use saxpy on a 2D array.

Numba python CUDA vs. cuBLAS speed difference on simple operations

Answers (1)

Related Questions