user727062
user727062

Reputation: 123

CUDA performance test

I'm writing a simple CUDA program for performance test.
This is not related to vector calculation, but just for a simple (parallel) string conversion.

#include <stdio.h>
#include <string.h>
#include <cuda_runtime.h>


#define UCHAR           unsigned char
#define UINT32          unsigned long int

#define CTX_SIZE        sizeof(aes_context)
#define DOCU_SIZE       4096
#define TOTAL           100000
#define BBLOCK_SIZE     500


UCHAR           pH_TXT[DOCU_SIZE * TOTAL];
UCHAR           pH_ENC[DOCU_SIZE * TOTAL];
UCHAR*          pD_TXT;
UCHAR*          pD_ENC;


__global__
void    TEST_Encode( UCHAR *a_input, UCHAR *a_output )
{
    UCHAR       *input;
    UCHAR       *output;

    input   = &(a_input[threadIdx.x * DOCU_SIZE]);
    output  = &(a_output[threadIdx.x * DOCU_SIZE]);

    for ( int i = 0 ; i < 30 ; i++ ) {
        if ( (input[i] >= 'a') && (input[i] <= 'z') ) {
            output[i] = input[i] - 'a' + 'A';
        }
        else {
            output[i] = input[i];
        }
    }
}


int main(int argc, char** argv)
{
    struct  cudaDeviceProp  xCUDEV;

    cudaGetDeviceProperties(&xCUDEV, 0);


    // Prepare Source
    memset(pH_TXT, 0x00, DOCU_SIZE * TOTAL);

    for ( int i = 0 ; i < TOTAL ; i++ ) {
        strcpy((char*)pH_TXT + (i * DOCU_SIZE), "hello world, i need an apple.");
    }

    // Allocate vectors in device memory
    cudaMalloc((void**)&pD_TXT, DOCU_SIZE * TOTAL);
    cudaMalloc((void**)&pD_ENC, DOCU_SIZE * TOTAL);

    // Copy vectors from host memory to device memory
    cudaMemcpy(pD_TXT, pH_TXT, DOCU_SIZE * TOTAL, cudaMemcpyHostToDevice);

    // Invoke kernel
    int threadsPerBlock = BLOCK_SIZE;
    int blocksPerGrid = (TOTAL + threadsPerBlock - 1) / threadsPerBlock;

    printf("Total Task is %d\n", TOTAL);
    printf("block size is %d\n", threadsPerBlock);
    printf("repeat cnt is %d\n", blocksPerGrid);

    TEST_Encode<<<blocksPerGrid, threadsPerBlock>>>(pD_TXT, pD_ENC);

    cudaMemcpy(pH_ENC, pD_ENC, DOCU_SIZE * TOTAL, cudaMemcpyDeviceToHost);

    // Free device memory
    if (pD_TXT)         cudaFree(pD_TXT);
    if (pD_ENC)         cudaFree(pD_ENC);

    cudaDeviceReset();
}

And when i change BLOCK_SIZE value from 2 to 1000, I got a following duration time (from NVIDIA Visual Profiler)

TOTAL       BLOCKS      BLOCK_SIZE  Duration(ms)
100000      50000       2           28.22
100000      10000       10          22.223
100000      2000        50          12.3
100000      1000        100         9.624
100000      500         200         10.755
100000      250         400         29.824
100000      200         500         39.67
100000      100         1000        81.268

My GPU is GeForce GT520 and max threadsPerBlock value is 1024, so I predicted that I would get best performance when BLOCK is 1000, but the above table shows different result.

I can't understand why Duration time is not linear, and how can I fix this problem. (or how can I find optimized Block value (mimimum Duration time)

Upvotes: 1

Views: 1260

Answers (1)

djmj
djmj

Reputation: 5544

It seems 2, 10, 50 threads doesn't utilize the capabilities of the gpu since its design is to start much more threads.

Your card has compute capability 2.1.

  • Maximum number of resident threads per multiprocessor = 1536
  • Maximum number of threads per block = 1024
  • Maximum number of resident blocks per multiprocessor = 8
  • Warp size = 32

There are two issues:

1.

You try to occupy so much register memory per thread that it will definetly is outsourced to slow local memory space if your block sizes increases.

2.

Perform your tests with multiple of 32 since this is the warp size of your card and many memory operations are optimized for thread sizes with multiple of the warp size.

So if you use only around 1024 (1000 in your case) threads per block 33% of your gpu is idle since only 1 block can be assigned per SM.

What happens if you use the following 100% occupancy sizes?

  • 128 = 12 blocks -> since only 8 can be resident per sm the block execution is serialized
  • 192 = 8 resident blocks per sm
  • 256 = 6 resident blocks per sm
  • 512 = 3 resident blocks per sm

Upvotes: 3

Related Questions