Cuda kernel with global memory vs Cuda kernel with constant memory

Question

I have two kernels for doing a matrix multiplication, one uses global memory and the second one uses constant memory. I wanted to use the Cuda profiler to test the speed of both kernels.

I tested both on a 1.3 device and on a 2.0 device. I was expecting the kernel with constant memory to be faster on the 1.3 device and the global memory kernel to be faster on the 2.0 device because of the use of cache for global memory on those devices but I found that in both devices the global memory kernel is faster. Is this due to memory coalescing on global memory? If so is there a way to make the constant kernel faster?

I'm using matrixes of 80x80 and Block size of 16.

Here is the global memory kernel

__global__ void MatMulGlobKernel(const Matriz A, const Matriz B, Matriz C) {

float Cvalor = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

if(fil > A.height || col > B.width) return;

for (int e = 0; e < A.width; ++e)
Cvalor += A.valores[row * A.width + e] * B.valores[e * B.width + col];

C.valores[row * C.width + col] = Cvalor;
}

A.valores, B.valores and C.valores reside in global memory.

Now here is the constant memory kernel.

__global__ void MatMulConstKernel(const Matriz A, const Matriz B, Matriz C) {

float Cvalor = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

if(fil > A.height || col > B.width) return;

for (int e = 0; e < A.width; ++e)
Cvalor += A_const_valores[row * A.width + e] * B_const_valores[e * B.width + col];

C.valores[row * C.width + col] = Cvalor;
}

A_const_valores and B_const_valores reside in constant memory while C.valores resides in global memory.

This is the profiler result for the 1.3 device (Tesla M1060)

Const kernel 101.70us

Global kernel 51.424us

and for the 2.0 device (GTX 650)

Const kernel 178.05us

Global kernel 58.144us

Robert Crovella · Accepted Answer

Matrix multiplication usually has some components where adjacent threads are accessing adjacent values from memory. Your kernels have a load that behaves this way:

B.valores[e * B.width + col];

When reading from global memory, this load can be serviced in a single cycle (from the L1 or L2 cache) to the warp. Yes, this is a coalesced load.

Constant memory, on the other hand, can only serve one 32-bit quantity per cycle. Therefore the constant cache will take 32 cycles to deliver the same requested data to the warp.

This would not be a typical use case for constant memory. Constant memory is best used when every thread in the warp is requesting the same location in memory.

As an experiment, you might see what kind of results you get if you keep the A matrix in __constant__ memory and the B matrix in global memory.

If you really want fast matrix multiply, however, use CUBLAS.

Cuda kernel with global memory vs Cuda kernel with constant memory

Answers (1)

Related Questions