Slow performance of CUDA kernel vs CPU version for Julia set

Question

I am learning CUDA from the book "CUDA by example".In the chapter 4 there is a demo to generate Julia fractals.The showcase demonstrates both CPU and GPU versions.I decided to add a time to see the execution speed for both cases and to my great surprise found that the CPU version executes 3 times faster than GPU.

CPU Julia generation total time:

745 milliseconds .

GPU Julia generation total time:

2456 milliseconds .

So what is going on ? It is clear ,at least from the CUDA kernel code that the execution is parallel as is distributed among 1000 block each of which calculates a pixel for 1000x1000 resolution final image.

Here is the source code of the implementation:

 #define N 10
 #define DIM 1000
 typedef unsigned char byte;

struct cuComplex {
   float   r;
   float   i;
   __host__ __device__ cuComplex( float a, float b ) : r(a), i(b)  {}
   __host__  __device__ float magnitude2( void ) {
          return r * r + i * i;
   }
   __host__ __device__ cuComplex operator*(const cuComplex& a) {
        return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
   }
    __host__ __device__ cuComplex operator+(const cuComplex& a) {
        return cuComplex(r+a.r, i+a.i);
   }
};

 __device__ int juliaGPU(int x , int y){
    const float scale =1.3;
    float jx = scale * (float)(DIM/2 -x)/(DIM/2);
    float jy=  scale *(float)(DIM/2 -y)/(DIM/2);

    cuComplex c(-0.8 ,0.156);
    cuComplex a(jx ,jy);
    int i = 0;
    for(i=0; i <200;i++){
        a = a * a +c;
        if(a.magnitude2() >1000){

            return 0;
        }
    }
    return 1;

 }

 __global__ void kernelGPU(byte *ptr){
    int x = blockIdx.x;
    int y = blockIdx.y;
    int offset =x + y * gridDim.x;

    int juliaValue =juliaGPU(x , y);
    ptr[offset * 4 + 0]=255 * juliaValue;
    ptr[offset * 4 + 1]=0;
    ptr[offset * 4 + 2]=0;
    ptr[offset * 4 + 3]=255 ;
}


 struct DataBlock {
    unsigned char   *dev_bitmap;
};
 void juliaGPUTestSample(){
 DataBlock   data;
CPUBitmap bitmap(DIM,DIM);
byte *dev_bitmap; //memory on GPU 
HANDLE_ERROR(cudaMalloc((void**)&dev_bitmap , bitmap.image_size()) );
data.dev_bitmap =dev_bitmap;
dim3 grid(DIM,DIM);
int starTime=glutGet(GLUT_ELAPSED_TIME);

kernelGPU<<>>(dev_bitmap);
HANDLE_ERROR(cudaMemcpy(bitmap.get_ptr() , dev_bitmap ,bitmap.image_size() ,cudaMemcpyDeviceToHost ) );
int endTime=glutGet(GLUT_ELAPSED_TIME)-starTime;
printf("Total time %d
:" ,endTime);
HANDLE_ERROR(cudaFree(dev_bitmap));

bitmap.display_and_exit();
 }

 int main(void){
juliaGPUTestSample();
return 1;

 }

Here is the CPU version :

/// the "cuComplex " struct is the same from above.

int julia (int x , int y){

const float scale = 1.3;
float jx = scale * (float)(DIM/2 -x)/(DIM/2);
float jy = scale * (float)(DIM/2 -y)/(DIM/2);

cuComplex c(-0.8 ,0.156);
cuComplex a(jx ,jy);

int i = 0;
for(i=0; i <200;i++){

    a = a * a +c;
    if(a.magnitude2() >1000){

        return 0;
    }
}

return 1;

}

void kernel(unsigned char *ptr){

for(int y = 0 ; y



Update -system configurations :

Windows 7 64bit

CPU - Intel i7 -3770CPU 3.40GHz ,16GB RAM

GPU - NVidia Quadro 4000

Robert Crovella · Accepted Answer

Others have noticed this.

First of all, when talking about perf comparisons between CPU and GPU it's a good idea to mention the system configuration including hw platform and software. For example, I ran your code on an HP laptop with a core i7 2.60GHz quad core CPU and a quadro1000M GPU, running RHEL 6.2, and cuda 5.0, and I got a score of 438 for the GPU and 441 for the CPU.

Second, and more importantly, the julia sample in that book is a relatively early example of CUDA coding, and so it's not really oriented towards max performance, but rather to illustrate the concepts that have been discussed so far. That book and various other CUDA tutorial material starts by introducing parallel programming using CUDA at the block level. The indication of this is here:

kernelGPU<<>>(dev_bitmap);

The kernel launch parameters <<>> indicate that a grid of some number (grid, which is 1 million total blocks in this case) blocks will be launched, with each block having a single thread. This immediately reduces the power of a Fermi-class GPU, for example, by a factor of 1/32 compared with launching a grid with a full complement of threads per threadblock. Each SM in a Fermi-class GPU has 32 thread processors, all executing in lockstep. If you launch a block with only 16 threads in it, then 16 thread processors will execute your code and the other 16 thread processors will do nothing (i.e. nothing useful). A threadblock containing only 1 thread will therefore use only 1 out of 32 thread processors, the other 31 being idle.

Therefore this particular code sample is not well-designed to utilize the full parallel capability of the GPU. Given that it is relatively early in the exposition of CUDA concepts in the book, this is understandable; I don't believe it was the authors intent to have this code benchmarked or used as a legitimate representation of how to write fast code on the GPU.

In light of this factor of 1/32, the idea that on your system the CPU is only 3 times faster, and on my system the CPU and GPU have comparable throughput (niether of these being particularly high-performance CUDA GPUs, most likely) I think it shows the GPU in reasonably good light. The GPU is fighting this battle with about 97% of it's capability unused.

Slow performance of CUDA kernel vs CPU version for Julia set

Answers (1)

Related Questions