shared memory in cuda

Question

I have a matrix, u, of size NxN of complex data, and I'd like to multiply elementwise each row by a vector, k, of size 1xN. The data in u is stored row-wise.

I have two implementations of this. One which takes advantage of shared memory, dividing the matrix into tiles, and another which does not.

I am finding that the shared memory implementation, multiply1, is not faster, and is systematically just as fast, or even slower than multiply2.

The shared memory implementation is as follows,

__global__ void multiply1(cufftComplex *u, cufftComplex *k) {
     __shared__ cufftComplex k_s[BLOCK_WIDTH];
     int idx = blockDim.x*blockIdx.x + threadIdx.x;
     int idy = blockDim.y*blockIdx.y + threadIdx.y;
     int index;

     if (threadIdx.y == 0 && idx < N) {
         k_s[threadIdx.x] = k[idx];
     }
     __syncthreads();

     if (idx < N && idy < N) {
         index = N*idy + idx;
         u[index] = cuCmulf(k_s[threadIdx.x],u[index]);
     }

}

Whereas the global memory implementation is as follows,

__global__ void multiply2(cufftComplex *u, cufftComplex *k) {
        int idx = blockDim.x * blockIdx.x + threadIdx.x;

        if (idx < N*N) { 
            u[idx] =cuCmulf(k[idx % N],u[idx]);
        }
    }

and the main function call, for a matrix of size 64x64

dim3 block(16,16);
dim3 grid(4,4);
multiply1<<>>(d_u, d_k);
multiply2<<<16, 256>>>(d_u, d_k);

How can I use the profiler to find out why exactly multiply1 is not getting even a least a slight increase in speed? Which metrics would elucidate what exactly is going on?

The profiler is telling me that for multiply1, I am getting 152 GB/s global memory load throughput, whereas for multiply2 I am getting 81 GB/s. This is logical, because I'm loading less from global memory. Shouldn't this translate into quicker execution?

shared memory in cuda

Answers (1)

Related Questions