Inefficient kernel function

Question

Is there any possibility to accelerate this simple kernel function? I have thought about using shared memory but N is equal to 507904, so it is much more than shared memory array could be.

My program creates blocks of 256 threads each.

__global__ void compute(COMPLEX_TYPE *a, COMPLEX_TYPE *b,
              FLOAT_TYPE *F, FLOAT_TYPE f, int N) 
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) 
    {
        F[i] = ( a[i].x*a[i].x + a[i].y*a[i].y + b[i].x*b[i].x + b[i].y*b[i].y) / (f);
    }
}

talonmies · Accepted Answer

The simplest general optimisation would be something like this:

__global__ void compute(const COMPLEX_TYPE * __restrict__ a, 
                        const COMPLEX_TYPE * __restrict__ b,
                        FLOAT_TYPE *F, FLOAT_TYPE f, int N) 
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    #pragma unroll 8
    for(; i < N; i += blockDim.x * gridDim.x;)
    {
        COMPLEX_TYPE aval = a[i], bval = b[i]
        FLOAT_TYPE Fval;
        Fval = ( aval.x*aval.x + aval.y*aval.y + bval.x*bval.x + bval.y*bval.y) / (f);
        F[i] = Fval;
    }
}

[disclaimer: written in browser, not tested, use at own risk]

The idea here is to launch only as many threads as will execute concurrently on your target GPU, and then have every thread perform multiple operations rather than one. This helps amortise a lot of the fixed overhead at the block scheduler and setup code level and improve the overall efficiency. On most architectures, this will probably be memory bandwidth limited anyway, so memory coalescing and transaction optimisation is about the most important performance optimisation you will be able to make.

EDIT: Since this answer was marked CW, I elected to add my tests here, rather than create my own answer. If anyone objects to this, please just roll back the edit to a previous acceptable version. I'm not adding any new ideas, just testing those provided by @talonmies and @JanLucas

In my test case, the suggestions (excepting the unroll pragma) offered by @talonmies seem to give rise to a ~10% perf improvement. The suggestion by @JanLucas, to replace the floating-point divide with a floating point multiply, if acceptable, seem to give about a doubling of performance. This will obviously vary depending on GPU and other specifics. Here's my test:

$ cat t891.cu
#include 
#include 
#include 

#define DSIZE 507904
#define nTPB 256
#define nBLK 256

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)
", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING
"); \
            exit(1); \
        } \
    } while (0)

#include 
#include 
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

typedef cuFloatComplex COMPLEX_TYPE;
typedef float FLOAT_TYPE;

__global__ void compute(COMPLEX_TYPE *a, COMPLEX_TYPE *b,
              FLOAT_TYPE *F, FLOAT_TYPE f, int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)
    {
        F[i] = ( a[i].x*a[i].x + a[i].y*a[i].y + b[i].x*b[i].x + b[i].y*b[i].y) / (f);
    }
}

__global__ void compute_imp(const COMPLEX_TYPE * __restrict__ a,
                        const COMPLEX_TYPE * __restrict__ b,
                        FLOAT_TYPE *F, FLOAT_TYPE f, int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
//    #pragma unroll 8
    for(; i < N; i += blockDim.x * gridDim.x)
    {
        COMPLEX_TYPE aval = a[i];
        COMPLEX_TYPE bval = b[i];
        FLOAT_TYPE Fval = ( aval.x*aval.x + aval.y*aval.y + bval.x*bval.x + bval.y*bval.y) / (f);
        F[i] = Fval;
    }
}

__global__ void compute_imp2(const COMPLEX_TYPE * __restrict__ a,
                        const COMPLEX_TYPE * __restrict__ b,
                        FLOAT_TYPE *F, FLOAT_TYPE f, int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
//    #pragma unroll 8
    for(; i < N; i += blockDim.x * gridDim.x)
    {
        COMPLEX_TYPE aval = a[i];
        COMPLEX_TYPE bval = b[i];
        FLOAT_TYPE Fval = ( aval.x*aval.x + aval.y*aval.y + bval.x*bval.x + bval.y*bval.y) * (f);
        F[i] = Fval;
    }
}

int main(){

  COMPLEX_TYPE *d_A, *d_B;
  FLOAT_TYPE *d_F, f = 4.0f;

  cudaMalloc(&d_A, DSIZE*sizeof(COMPLEX_TYPE));
  cudaMalloc(&d_B, DSIZE*sizeof(COMPLEX_TYPE));
  cudaMalloc(&d_F, DSIZE*sizeof(FLOAT_TYPE));

  //warm-up
  compute<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_F, f, DSIZE);
  cudaDeviceSynchronize();
  unsigned long long t1 = dtime_usec(0);
  compute<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_F, f, DSIZE);
  cudaDeviceSynchronize();
  t1 = dtime_usec(t1);

  //warm-up
  compute_imp<<>>(d_A, d_B, d_F, f, DSIZE);
  cudaDeviceSynchronize();
  unsigned long long t2 = dtime_usec(0);
  compute_imp<<>>(d_A, d_B, d_F, f, DSIZE);
  cudaDeviceSynchronize();
  t2 = dtime_usec(t2);

  //warm-up
  compute_imp2<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_F, 1/f, DSIZE);
  cudaDeviceSynchronize();
  unsigned long long t3 = dtime_usec(0);
  compute_imp2<<>>(d_A, d_B, d_F, 1/f, DSIZE);
  cudaDeviceSynchronize();
  t3 = dtime_usec(t3);
  cudaCheckErrors("some error");
  printf("t1: %fs, t2: %fs, t3: %fs
", t1/(float)USECPSEC, t2/(float)(USECPSEC), t3/(float)USECPSEC);
}
$ nvcc -O3 -o t891 t891.cu
$ ./t891
t1: 0.000226s, t2: 0.000209s, t3: 0.000110s
$

Notes:

The unroll pragma doesn't seem to help (it makes it run slower, for a few test cases I tried). The compiler already will, in some cases, unroll loops without a specific hint, and loop unrolling is generally an optimization that requires tuning, perhaps careful tuning.
The modification to the kernel proposed by @talonmies to create a grid-striding loop is one of the factors that would need to be taken into account to make a specific loop-unroll trip count useful. The overall grid dimension should be reduced by a factor equal to the unroll trip count, at least. However I wasn't able to find a "sweet spot".
I mostly tested on a Quadro5000 (Fermi cc2.0 GPU), CUDA 7.5RC, Fedora20. Certainly the behavior will be different on different GPUs, especially newer ones.
The nBLK parameter in this code is another "tunable" parameter, however I saw little variation with this when above about 64 or so. The best case might be to have a grid equal in size to the data.

Inefficient kernel function

Answers (1)

Related Questions