the performance of CUDA depending on declaring variable

Question

Is there any tip for improving CUDA performance in that case such as declaring global/local variable, parameter passing, memory copy.

I'm trying to figure out the reason why two performance are too different between sum_gpu_FAST and sum_gpu_SLOW in example below.

Here you can see the whole example code.

#include 
#include 
#define N 10000000
__global__
void sum_gpu_FAST(int (&data)[N][2], int& sum, int n) {  // runtime : 2.42342s
    int s = 0;
    for (int i = 0; i < n; i++) 
        s += data[i][0] * 10 + data[i][1];
    sum = s;
}
__global__
void sum_gpu_SLOW(int (&data)[N][2], int& sum, int n) {  // runtime : 436.64ms
    sum = 0;
    for (int i = 0; i < n; i++) {
        sum += data[i][0] * 10 + data[i][1];
    }
}
void sum_cpu(int (*data)[2], int& sum, int n) {
    for (int i = 0; i < n; i++) {
        sum +=  data[i][0] * 10 + data[i][1];
    }
}
int main()
{
    int (*v)[2] = new int[N][2];
    for (int i = 0; i < N; i++)
        v[i][0] = 1, v[i][1] = 3;
    printf ("-CPU------------------------------------------------
");
    {
        int sum = 0;
        auto start = std::chrono::system_clock::now();
        sum_cpu(v, sum, N);
        auto end   = std::chrono::system_clock::now();
        // print output
        std::cout << sum << " / " << (end-start).count() / 1000000 << "ms" << std::endl;
    }
    printf ("-GPU-Ready------------------------------------------
");
    int *dev_sum       = nullptr;
    int (*dev_v)[N][2] = nullptr;
    cudaMalloc((void **)&dev_v,   sizeof(int[N][2]));
    cudaMalloc((void **)&dev_sum, sizeof(int));
    cudaMemcpy(dev_v, v, sizeof(int[N][2]), cudaMemcpyHostToDevice);
    printf("-GPU-FAST-------------------------------------------
");
    {
        int sum = 0;
        auto start = std::chrono::system_clock::now();
        sum_gpu_FAST<<<1, 1>>> (*dev_v, *dev_sum, N);
        cudaDeviceSynchronize(); // wait until end of kernel
        auto end   = std::chrono::system_clock::now();
        // print output
        cudaMemcpy( &sum, dev_sum, sizeof(int), cudaMemcpyDeviceToHost );
        std::cout << sum << " / " << (end-start).count() / 1000000 << "ms" << std::endl;
    }
    printf("-GPU-SLOW-------------------------------------------
");
    {
        int sum = 0;
        auto start = std::chrono::system_clock::now();
        sum_gpu_SLOW<<<1, 1>>> (*dev_v, *dev_sum, N);
        cudaDeviceSynchronize(); // wait until end of kernel
        auto end   = std::chrono::system_clock::now();
        // print output
        cudaMemcpy( &sum, dev_sum, sizeof(int), cudaMemcpyDeviceToHost );
        std::cout << sum << " / " << (end-start).count() / 1000000 << "ms" << std::endl;
    }
    printf("----------------------------------------------------
");
    return 0;
}

Robert Crovella · Accepted Answer

I'm trying to figure out the reason why two performance are too different between sum_gpu_FAST and sum_gpu_SLOW in example below.

In the fast case, you are creating a local variable which is contained (presumably) in a register:

int s = 0;

During the loop iterations, reads are occurring from global memory, but the only write operation is to a register:

for (int i = 0; i < n; i++) 
    s += data[i][0] * 10 + data[i][1];

In the slow case, the running sum is contained in a variable resident in global memory:

sum = 0;

therefore, at each loop iteration, the updated value is written to global memory:

for (int i = 0; i < n; i++) {
    sum += data[i][0] * 10 + data[i][1];

Therefore the loop has additional overhead to write to global memory at each iteration, which is slower than maintaining the sum in a register.

I'm not going to completely dissect the SASS code to compare these two cases, because the compiler is making other decisions in the fast case around loop unrolling and possibly other factors, but my guess is that the lack of a need to store results to global memory during the loop iterations considerably assists with loop unrolling as well. However we can make a simple deduction based on the tail end of the SASS code for each case:

                Function : _Z12sum_gpu_FASTRA10000000_A2_iRii
        .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;                        /* 0x00000a0000017a02 */
                                                                                 /* 0x000fd00000000f00 */
...
        /*0b00*/                   STG.E.SYS [R2], R20 ;                         /* 0x0000001402007386 */
                                                                                 /* 0x000fe2000010e900 */
        /*0b10*/                   EXIT ;                                        /* 0x000000000000794d */
                                                                                 /* 0x000fea0003800000 */

In the fast case above, we see that there is a single global store (STG) instruction at the end of the kernel, right before the return statement (EXIT), and outside of any loops in the kernel. Although I haven't shown it all, indeed there are no other STG instructions in the fast kernel, except the one at the end. We see a different story looking at the tail end of the slow kernel:

        code for sm_70
                Function : _Z12sum_gpu_SLOWRA10000000_A2_iRii
        .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;       /* 0x00000a00ff017624 */
                                                                                 /* 0x000fd000078e00ff */
...
        /*0460*/                   STG.E.SYS [R2], R7 ;                          /* 0x0000000702007386 */
                                                                                 /* 0x0005e2000010e900 */
        /*0470*/              @!P0 BRA 0x2f0 ;                                   /* 0xfffffe7000008947 */
                                                                                 /* 0x000fea000383ffff */
        /*0480*/                   EXIT ;                                        /* 0x000000000000794d */
                                                                                 /* 0x000fea0003800000 */

The slow kernel ends a loop with the STG instruction inside the loop. The slow kernel also has many instances of the STG instruction throughout the kernel, presumably because of compiler unrolling.

the performance of CUDA depending on declaring variable

Answers (1)

Related Questions