Reputation: 6623

Low DRAM utilization with shared memory

I am implementing a simple geometric brownian motion on GPU. My code works well, i.e. gives the correct value. My concern is with respect to the speed up that I'm getting I was expecting a bit more. So far I have 2 implementations one accessing only global memory, gives about 3x speed up and the second is using shared memory, which is giving about 2.3x speed up.

My question came after profiling the application with Nvidia Visual Profiler. According to it I have Load/Store efficiency of 100% but a very low DRAM utilization (about 10%) and almost 50% Global Memory replay because non-coalesced accesses.

Once I saw that I tried to use shared memory to avoid global memory accesses all the time but my surprise was that the DRAM got lower (4.5%) and the Global memory replay to 46.3%

I noticed that the occupancy in my kernel launch is low because I'm almost using all the available shared memory per block but I don't know if this could explain the worse performance of the second approach.

Can you give some advise on what could be happening in terms of performance and probably where/what can I look for to try to improve it?

CUDA_IMPLEMENTATION.CU

#define BLOCK_SIZE  64

#define SHMEM_ROWS  7       //The same as c_numTimeSteps = numTimeSteps
#define SHMEM_COLS  BLOCK_SIZE

__constant__ double c_c1;
__constant__ double c_c2;
__constant__ int c_numTimeSteps;
__constant__ int c_numPaths;
__constant__ double c_timeNodes[2000];

__global__
void kernelSharedMem(double *rv, double *pb)
{
    __shared__ double sh_rv[SHMEM_ROWS*SHMEM_COLS];
    __shared__ double sh_pb[(SHMEM_ROWS+1)*SHMEM_COLS];

    int p = blockDim.x * blockIdx.x + threadIdx.x;

    //The idea of this outter loop is to have tiles along the rows
    for(int tb = 0; tb < c_numTimeSteps; tb += SHMEM_ROWS)
    {
        //Copy values into shared memory
        for(int is = tb, isSh = 0;
            is < tb+SHMEM_ROWS && is < c_numTimeSteps;
            is++, isSh++)
        {
            sh_rv[isSh*SHMEM_COLS+threadIdx.x] = 
                rv[is*c_numPaths+p];
        }

        sh_pb[threadIdx.x] = pb[tb*numPaths+p];

        __syncthreads();

        //Main computation in SHARED MEMORY
        for(int isSh = 0; isSh < SHMEM_ROWS; isSh++)
        {
            double dt = c_timeNodes[isSh];
            double sdt = sqrt(dt) * c_c1;
            double mdt = c_c2 * dt;

            sh_pb[(isSh+1)*SHMEM_COLS+threadIdx.x] =
                sh_pb[isSh*SHMEM_COLS+threadIdx.x] *
                exp(mdt + sdt * rv[isSh*SHMEM_COLS+threadIdx.x]);

        }

        __syncthreads();

        for(int is = tb, isSh = 0;
            is < tb+SHMEM_ROWS && is < c_numTimeSteps;
            is++, isSh++)
        {
            pb[(is+1)*c_numPaths+p] = 
                sh_pb[(isSh+1)*SHMEM_COLS+threadIdx.x];
        }

    }

}

__global__
void kernelGlobalMem(double *rv, double *pb)
{
    int p = blockDim.x * blockIdx.x + threadIdx.x;

    for(int i = 0; i < c_numTimeSteps; i++)
    {
        double dt = c_timeNodes[i];
        double sdt = sqrt(dt) * c_c1;
        double mdt = c_c2 * dt;

        pb[(i+1)*c_numPaths+p] = 
            pb[i*c_numPaths+p] *
            exp(mdt + sdt * rv[i*c_numPaths+p]);

    }

}

extern "C" void computePathGpu(vector<vector<double>>* rv,
                                vector<vector<double>>* pb,
                                int numTimeSteps, int numPaths,
                                vector<double> timeNodes,
                                double c1, double c2)
{

    cudaMemcpyToSymbol(c_c1, &c1, sizeof(double));
    cudaMemcpyToSymbol(c_c2, &c2, sizeof(double));
    cudaMemcpyToSymbol(c_numTimeSteps, &numTimeSteps, sizeof(int));
    cudaMemcpyToSymbol(c_numPaths, &numPaths, sizeof(int));
    cudaMemcpyToSymbol(c_timeNodes, &(timeNodes[0]), sizeof(double)*numTimeSteps);

    double *d_rv;
    double *d_pb;

    cudaMalloc((void**)&d_rv, sizeof(double)*numTimeSteps*numPaths);
    cudaMalloc((void**)&d_pb, sizeof(double)*(numTimeSteps+1)*numPaths);

    vector<vector<double>>::iterator itRV;
    vector<vector<double>>::iterator itPB;

    double *dst = d_rv;
    for(itRV = rv->begin(); itRV != rv->end(); ++itRV)
    {
        double *src = &((*itRV)[0]);
        size_t s = itRV->size();
        cudaMemcpy(dst, src, sizeof(double)*s, cudaMemcpyHostToDevice);
        dst += s;
    }

    cudaMemcpy(d_pb, &((*(pb->begin()))[0]),
        sizeof(double)*(pb->begin())->size(), cudaMemcpyHostToDevice);

    dim3 block(BLOCK_SIZE);
    dim3  grid((numPaths+BLOCK_SIZE-1)/BLOCK_SIZE);

    kernelGlobalMem<<<grid, block>>>(d_rv, d_pb);
    //kernelSharedMem<<<grid, block>>>(d_rv, d_pb);
    cudaDeviceSynchronize();

    dst = d_pb;
    for(itPB = ++(pb->begin()); itPB != pb->end(); ++itPB)
    {
        double *src = &((*itPB)[0]);
        size_t s = itPB->size();
        dst += s;
        cudaMemcpy(src, dst, sizeof(double)*s, cudaMemcpyDeviceToHost);
    }

    cudaFree(d_pb);
    cudaFree(d_rv);

}

MAIN.CPP

extern "C" void computeOnGPU(vector<vector<double>>* rv,
                                vector<vector<double>>* pb,
                                int numTimeSteps, int numPaths,
                                vector<double> timeNodes,
                                double c1, double c2);

int main(){

    int numTimeSteps = 7;
    int numPaths = 2000000;

    vector<vector<double>> rv(numTimeSteps, vector<double>(numPaths));
    //Fill rv

    vector<double> timeNodes(numTimeSteps);
    //Fill timeNodes

    vector<vector<double>> pb(numTimeSteps, vector<double>(numPaths, 0));

    computeOnGPU(&rv, &pb, numTimeSteps, numPaths, timeNodes, 0.2, 0.123);

}

Upvotes: 0

Answers (3)

kangshiyin

Reputation: 9781

After profiling your code on my Tesla M2090, I think we should reorder all these suggestions provided by these answers.

Try to reduce the memcopy time. 97% time spent on memcopy including H2D and D2H. Since you use pageable memcpy, the speed is 2.5G/s~3G/s. You can double the speed by using pinned mem cpy. Zero copy and other Mem optimization techniques can be applied to further increase the memcopy speed.
Move the sqrt() out of the kernel. You can do 7 times of sqrt() on CPU instead of doing it 7 x 2,000,000 times on GPU. However since your kernel is tiny (3% of the total time of computePathGpu()), this won't take much effect.
reduce the global mem access. In your code, you just read rv once, read pb once and write pb once. However only first row of your pb contains useful data before invoking the kenel. So the reading of the whole pb can be eliminated by using registers. The solution is provided in the code.
About non-coalesced mem access, you can find the discussion here. Your case is belongs to "A Sequential but Misaligned Access Pattern". A solution using cudaMallocPitch() is described below and provided in the following code.

Note: you mentioned that you have a low DRAM utilization (about 10%), but profiling on my device is OK (55.8%). Maybe it's my device that is a little old(M2090 CC2.0)

Profiling Result

#include <vector>

using namespace std;

#define BLOCK_SIZE  64
#define BLOCK_SIZE_OPT  256

__constant__ double c_c1;
__constant__ double c_c2;
__constant__ int c_numTimeSteps;
__constant__ int c_numPaths;
__constant__ double c_timeNodes[2000];

__global__ void kernelGlobalMem(double *rv, double *pb)
{
    int p = blockDim.x * blockIdx.x + threadIdx.x;

    for (int i = 0; i < c_numTimeSteps; i++)
    {
        double dt = c_timeNodes[i];
        double sdt = sqrt(dt) * c_c1;
        double mdt = c_c2 * dt;

        pb[(i + 1) * c_numPaths + p] =
                pb[i * c_numPaths + p] *
                        exp(mdt + sdt * rv[i * c_numPaths + p]);

    }

}

__global__ void kernelGlobalMemOpt(double *rv, double *pb, const size_t ld_rv, const size_t ld_pb)
{
    int p = blockDim.x * blockIdx.x + threadIdx.x;

    double pb0 = pb[p];
    for (int i = 0; i < c_numTimeSteps; i++)
    {
        double dt = c_timeNodes[i];
        double sdt = dt * c_c1;
        double mdt = c_c2 * dt * dt;

        pb0 *= exp(mdt + sdt * rv[i * ld_rv + p]);
        pb[(i + 1) * ld_pb + p] = pb0;
    }
}

void computePathGpu(vector<vector<double> >* rv,
        vector<vector<double> >* pb,
        int numTimeSteps, int numPaths,
        vector<double> timeNodes,
        double c1, double c2)
{

    cudaMemcpyToSymbol(c_c1, &c1, sizeof(double));
    cudaMemcpyToSymbol(c_c2, &c2, sizeof(double));
    cudaMemcpyToSymbol(c_numTimeSteps, &numTimeSteps, sizeof(int));
    cudaMemcpyToSymbol(c_numPaths, &numPaths, sizeof(int));
    cudaMemcpyToSymbol(c_timeNodes, &(timeNodes[0]), sizeof(double) * numTimeSteps);

    double *d_rv;
    double *d_pb;

    cudaMalloc((void**) &d_rv, sizeof(double) * numTimeSteps * numPaths);
    cudaMalloc((void**) &d_pb, sizeof(double) * (numTimeSteps + 1) * numPaths);

    vector<vector<double> >::iterator itRV;
    vector<vector<double> >::iterator itPB;

    double *dst = d_rv;
    for (itRV = rv->begin(); itRV != rv->end(); ++itRV)
    {
        double *src = &((*itRV)[0]);
        size_t s = itRV->size();
        cudaMemcpy(dst, src, sizeof(double) * s, cudaMemcpyHostToDevice);
        dst += s;
    }

    cudaMemcpy(d_pb, &((*(pb->begin()))[0]),
            sizeof(double) * (pb->begin())->size(), cudaMemcpyHostToDevice);

    dim3 block(BLOCK_SIZE);
    dim3 grid((numPaths + BLOCK_SIZE - 1) / BLOCK_SIZE);

    kernelGlobalMem<<<grid, block>>>(d_rv, d_pb);
    //kernelSharedMem<<<grid, block>>>(d_rv, d_pb);
    cudaDeviceSynchronize();

    dst = d_pb;
    for (itPB = ++(pb->begin()); itPB != pb->end(); ++itPB)
    {
        double *src = &((*itPB)[0]);
        size_t s = itPB->size();
        dst += s;
        cudaMemcpy(src, dst, sizeof(double) * s, cudaMemcpyDeviceToHost);
    }

    cudaFree(d_pb);
    cudaFree(d_rv);

}

void computePathGpuOpt(vector<vector<double> >* rv,
        vector<vector<double> >* pb,
        int numTimeSteps, int numPaths,
        vector<double> timeNodes,
        double c1, double c2)
{
    for(int i=0;i<timeNodes.size();i++)
    {
        timeNodes[i]=sqrt(timeNodes[i]);
    }

    cudaMemcpyToSymbol(c_c1, &c1, sizeof(double));
    cudaMemcpyToSymbol(c_c2, &c2, sizeof(double));
    cudaMemcpyToSymbol(c_numTimeSteps, &numTimeSteps, sizeof(int));
    cudaMemcpyToSymbol(c_numPaths, &numPaths, sizeof(int));
    cudaMemcpyToSymbol(c_timeNodes, &(timeNodes[0]), sizeof(double) * numTimeSteps);

    double *d_rv;
    double *d_pb;
    size_t ld_rv, ld_pb;

    cudaMallocPitch((void **) &d_rv, &ld_rv, sizeof(double) * numPaths, numTimeSteps);
    cudaMallocPitch((void **) &d_pb, &ld_pb, sizeof(double) * numPaths, numTimeSteps + 1);
    ld_rv /= sizeof(double);
    ld_pb /= sizeof(double);

//  cudaMalloc((void**) &d_rv, sizeof(double) * numTimeSteps * numPaths);
//  cudaMalloc((void**) &d_pb, sizeof(double) * (numTimeSteps + 1) * numPaths);

    vector<vector<double> >::iterator itRV;
    vector<vector<double> >::iterator itPB;

    double *dst = d_rv;
    for (itRV = rv->begin(); itRV != rv->end(); ++itRV)
    {
        double *src = &((*itRV)[0]);
        size_t s = itRV->size();
        cudaMemcpy(dst, src, sizeof(double) * s, cudaMemcpyHostToDevice);
        dst += ld_rv;
    }

    cudaMemcpy(d_pb, &((*(pb->begin()))[0]),
            sizeof(double) * (pb->begin())->size(), cudaMemcpyHostToDevice);

    dim3 block(BLOCK_SIZE_OPT);
    dim3 grid((numPaths + BLOCK_SIZE_OPT - 1) / BLOCK_SIZE_OPT);

    kernelGlobalMemOpt<<<grid, block>>>(d_rv, d_pb, ld_rv, ld_pb);
    //kernelSharedMem<<<grid, block>>>(d_rv, d_pb);
    cudaDeviceSynchronize();

    dst = d_pb;
    for (itPB = ++(pb->begin()); itPB != pb->end(); ++itPB)
    {
        double *src = &((*itPB)[0]);
        size_t s = itPB->size();
        dst += ld_pb;
        cudaMemcpy(src, dst, sizeof(double) * s, cudaMemcpyDeviceToHost);
    }

    cudaFree(d_pb);
    cudaFree(d_rv);

}

int main()
{

    int numTimeSteps = 7;
    int numPaths = 2000000;

    vector<vector<double> > rv(numTimeSteps, vector<double>(numPaths));
    vector<double> timeNodes(numTimeSteps);
    vector<vector<double> > pb(numTimeSteps, vector<double>(numPaths, 0));
    vector<vector<double> > pbOpt(numTimeSteps, vector<double>(numPaths, 0));
    computePathGpu(&rv, &pb, numTimeSteps, numPaths, timeNodes, 0.2, 0.123);
    computePathGpuOpt(&rv, &pbOpt, numTimeSteps, numPaths, timeNodes, 0.2, 0.123);
}

Each of your cuda thread compute one path for all time steps. According to your GlobalMem code, you don't share any data between paths. So shared mem is not required.

For the non-coalesced access issue detected by nvprof, it is because your data pb and rv are not well aligned. pb and rv can be seen as matrices with the size [time steps x #paths]. Since your #path is not a multiple of cache line, starting from the second row, i.e. time step, all the global mem access are non-coalesced. If your CUDA device is old, it will result in 50% mem replay. Newer device doesn't suffer from this kind of non-coalesced access.

The solution is simple. You can just add padding bytes to each end of the row so every row can start from a coalesced DRAM address. This can be automatically done by cudaMallocPitch()

There is another issue. In your code, you just read rv once, read pb once and write pb once. However your pb contains no useful data before invoking the kenel. So the reading of pb can be eliminated by using register, and you could get a extra 50% up on the speed besides solving the non-coalesced access issue.

Upvotes: 2

talonmies

Reputation: 72348

As others have poirnted out, the shared memory version doesn't change the global memory access patterns at all, and there is effectively no data reuse in the kernel between threads. So the coalescing problem isn't solved, and all you are effectively doing is added shared memory acces and several synchronization points as an overhead.

But look at what the kernel is really doing for a minute. The kernel is working in double precision, which is slow on consumer cards, and has a pretty reasonable number of operations within the computation loop, which is good. Without access to a compiler I would guess about half of the total time is floating point compute is in exp call and half the sqrt call. This probably shouldn't be a memory bound kernel on a consumer GPU. But roughly half the double precison operations are just each thread calculating the same sqrt(dt) value. That is an enormous waste of cycles. Why not have the kernel iterate in a "non-dimensionalised" sqrt(dt) domain instead. This means you precalculate the (up to) 2000 sqrt(dt) values on the host and store those in constant memory. The kernel loop could then be written as something like ths:

double pb0 = pb[p];
for(int i = 0; i < c_numTimeSteps; i++)
{
    double sdt = c_stimeNodes[i]; // sqrt(dt)
    double mdt = c_c2 * sdt * sdt;
    sdt *= c_c1;

    double pb1 = pb0 *  exp(mdt + sdt * rv[p]);

    p += c_numPaths;
    pb[p] = pb1;
    pb0 = pb1;
}

[ disclaimer: written in an ipad in the middle of lapland at 5am. Use at own risk ]

Doing this replaces a sqrt with a multiply, which is a huge reduction in operations. Note that I also took the liberty of simplifying the indexing calculations down to a single integer add per loop. The compiler is pretty smart, but you can makes its job as easy or as hard as you would like. I would suspect that a loop like the one above will be significantly faster than what you have now.

Upvotes: 3

a.lasram

Reputation: 4411

In kernelGlobalMem your're doing 3 * c_numTimeSteps read/write in rv and pb.

In kernelSharedMem your're doing 3 * c_numTimeSteps + c_numTimeSteps / SHMEM_ROWS read/write in rv and pb.

kernelSharedMem is more complex and the memory pattern looks similar.

kernelGlobalMem is definitely better than kernelSharedMem.

Upvotes: 0

Low DRAM utilization with shared memory

Answers (3)

Related Questions