xlog
xlog

Reputation: 113

CUDA: is there a faster way of writing to global memory?

I'm writing an n-body simulation, and basically the whole operation is:

-Prepare CUDA memory
 loop {
    -Copy data to CUDA
    -Launch kernel
    -Copy data to host
    -Operations using data (drawing etc.)
 }

I've noticed that almost 90% of time is spent writing data to global device memory in the kernel. Here is the kernel:

 __global__ void calculateForcesCuda(float *deviceXpos, float *deviceYpos, float *deviceZpos,
                                    float *deviceXforces, float *deviceYforces, float *deviceZforces,
                                    float *deviceMasses, int particlesNumber) {
     int tid = threadIdx.x + blockIdx.x * blockDim.x;
     if (tid <= particlesNumber) {
         float particleXpos = deviceXpos[tid];
         float particleYpos = deviceYpos[tid];
         float particleZpos = deviceZpos[tid];
         float xForce = 0.0f;
         float yForce = 0.0f;
         float zForce = 0.0f;
         for (int index=0; index<particlesNumber; index++) {
             if (tid != index) {
                 float otherXpos = deviceXpos[index];
                 float otherYpos = deviceYpos[index];
                 float otherZpos = deviceZpos[index];
                 float mass = deviceMasses[index];
                 float distx = particleXpos - otherXpos;
                 float disty = particleYpos - otherYpos;
                 float distz = particleZpos - otherZpos;
                 float distance = sqrt((distx*distx + disty*disty + distz*distz) + 0.01f);
                 xForce += 10.0f * mass / distance * (otherXpos - particleXpos);
                 yForce += 10.0f * mass / distance * (otherYpos - particleYpos);
                 zForce += 10.0f * mass / distance * (otherZpos - particleZpos);
             }
         }
         deviceXforces[tid] += xForce;
         deviceYforces[tid] += yForce;      
         deviceZforces[tid] += zForce;
     }
 }

The device running this is the GTX 970. The time it takes to execute is around 8.0 seconds, however after adding these flags: -gencode arch=compute_52,code=sm_52, the performance is increased to around 6.7 seconds. After commenting out the code that writes to global device memory:

deviceXforces[tid] += xForce;
deviceYforces[tid] += yForce;      
deviceZforces[tid] += zForce;

... the total execution time is reduced to around 0.92 seconds, which means that writing to global device memory takes about 86% of execution time. Is there a way I can increase the performance of these writes?

Upvotes: 3

Views: 2288

Answers (1)

CygnusX1
CygnusX1

Reputation: 21818

Memory is usually a bottle-neck in this kind of computation, even if it is not taking 90% of the time as you measured. I would suggest two things.

Load the device...[index] into shared memory

As it stands, all threads read the same deviceXpos[index], deviceYpos[index], deviceZpos[index] and deviceMasses[index]. You could, instead, load them into shared memory:

static const int blockSize = ....;

__shared__ float shXpos[blockSize];
__shared__ float shYpos[blockSize];
__shared__ float shZpos[blockSize];
__shared__ float shMasses[blockSize];
for (int mainIndex=0; mainIndex<particlesNumber; index+=blockSize) {
    __syncthreads(); //ensure computation from previous iteration has completed
    shXpos[threadIdx.x] = deviceXpos[mainIndex + threadIdx.x];
    shYpos[threadIdx.x] = deviceYpos[mainIndex + threadIdx.x];
    shZpos[threadIdx.x] = deviceZpos[mainIndex + threadIdx.x];
    shMasses[threadIdx.x] = deviceMasses[mainIndex + threadIdx.x];
    __syncthreads(); //ensure all data is read before computation starts
    for (int index=0; index<blockSize; ++index) {
        .... //your computation, using sh....[index] values
    }
}

This should reduce the amount of global memory reads, as each thread reads different data, rather than all reading the same thing.

Be aware however, that this suggestion may not be that effective, if the drivers correctly manage L1 caching. Try it though!

Handle more than 1 (receiving) particle per thread

You may want to perform calculation for more than one particle at a time. Instead of having just a single set of {particleX/Y/Zpos, x/y/zForce}, representing a single particle receiving the force, you could have a few of those at the same time. This way, by loading your source once in the loop, you can handle several receivers.

This may significantly reduce your memory pressure, but at the same time increase your register count. Too many registers - and you won't be able to launch that many threads.

Check how many registers your thread already has and consult the CUDA occupancy calculator to see how many more you can use. Maybe reducing the occupancy from 1 to 0.5 or 0.75, but at the same time handling more particles will be beneficial? You will need to experiment, as this may vary from GPU to GPU.

Upvotes: 3

Related Questions