Reputation: 2488
I am trying to compare performance in CPU and GPU. I have
I can confirm that GPU is configured and works correctly with CUDA.
I am implementing Julia set computation. http://en.wikipedia.org/wiki/Julia_set Basically for every pixel, if the co-ordinate is in the set it will paint it red else paint it white.
Although, I get identical answer with both CPU and GPU but instead of getting a performance improvement, I get a performance penalty by using GPU.
Running times
I am aware that transferring data from device to host can take up some time. But still, how do I know if use of GPU is actually beneficial?
Here is the relevant GPU code
#include <stdio.h>
#include <cuda.h>
__device__ bool isJulia( float x, float y, float maxX_2, float maxY_2 )
{
float z_r = 0.8 * (float) (maxX_2 - x) / maxX_2;
float z_i = 0.8 * (float) (maxY_2 - y) / maxY_2;
float c_r = -0.8;
float c_i = 0.156;
for( int i=1 ; i<100 ; i++ )
{
float tmp_r = z_r*z_r - z_i*z_i + c_r;
float tmp_i = 2*z_r*z_i + c_i;
z_r = tmp_r;
z_i = tmp_i;
if( sqrt( z_r*z_r + z_i*z_i ) > 1000 )
return false;
}
return true;
}
__global__ void kernel( unsigned char * im, int dimx, int dimy )
{
//int tid = blockIdx.y*gridDim.x + blockIdx.x;
int tid = blockIdx.x*blockDim.x + threadIdx.x;
tid *= 3;
if( isJulia((float)blockIdx.x, (float)threadIdx.x, (float)dimx/2, (float)dimy/2)==true )
{
im[tid] = 255;
im[tid+1] = 0;
im[tid+2] = 0;
}
else
{
im[tid] = 255;
im[tid+1] = 255;
im[tid+2] = 255;
}
}
int main()
{
int dimx=768, dimy=768;
//on cpu
unsigned char * im = (unsigned char*) malloc( 3*dimx*dimy );
//on GPU
unsigned char * im_dev;
//allocate mem on GPU
cudaMalloc( (void**)&im_dev, 3*dimx*dimy );
//launch kernel.
**for( int z=0 ; z<10000 ; z++ ) // loop for multiple times computation**
{
kernel<<<dimx,dimy>>>(im_dev, dimx, dimy);
}
cudaMemcpy( im, im_dev, 3*dimx*dimy, cudaMemcpyDeviceToHost );
writePPMImage( im, dimx, dimy, 3, "out_gpu.ppm" ); //assume this writes a ppm file
free( im );
cudaFree( im_dev );
}
Here is the CPU code
bool isJulia( float x, float y, float maxX_2, float maxY_2 )
{
float z_r = 0.8 * (float) (maxX_2 - x) / maxX_2;
float z_i = 0.8 * (float) (maxY_2 - y) / maxY_2;
float c_r = -0.8;
float c_i = 0.156;
for( int i=1 ; i<100 ; i++ )
{
float tmp_r = z_r*z_r - z_i*z_i + c_r;
float tmp_i = 2*z_r*z_i + c_i;
z_r = tmp_r;
z_i = tmp_i;
if( sqrt( z_r*z_r + z_i*z_i ) > 1000 )
return false;
}
return true;
}
#include <stdlib.h>
#include <stdio.h>
int main(void)
{
const int dimx = 768, dimy = 768;
int i, j;
unsigned char * data = new unsigned char[dimx*dimy*3];
**for( int z=0 ; z<10000 ; z++ ) // loop for multiple times computation**
{
for (j = 0; j < dimy; ++j)
{
for (i = 0; i < dimx; ++i)
{
if( isJulia(i,j,dimx/2,dimy/2) == true )
{
data[3*j*dimx + 3*i + 0] = (unsigned char)255; /* red */
data[3*j*dimx + 3*i + 1] = (unsigned char)0; /* green */
data[3*j*dimx + 3*i + 2] = (unsigned char)0; /* blue */
}
else
{
data[3*j*dimx + 3*i + 0] = (unsigned char)255; /* red */
data[3*j*dimx + 3*i + 1] = (unsigned char)255; /* green */
data[3*j*dimx + 3*i + 2] = (unsigned char)255; /* blue */
}
}
}
}
writePPMImage( data, dimx, dimy, 3, "out_cpu.ppm" ); //assume this writes a ppm file
delete [] data
return 0;
}
Further, following suggestions from @hyde I have looped the computation-only part to generate 10,000 images. I am not bothering to write all those images though. Computation only is what I am doing.
Here are the running times
Upvotes: 3
Views: 1742
Reputation: 62777
Turning comments to answer:
To get relevant figures, you needs to calculate more than one image, so that execution time is seconds or tens of seconds at least. Also, including file saving time in results is going to add noise and hide the actual CPU vs GPU difference.
Another way to get real results is to select a Julia set which has lot points belonging to the set, then upping the iteration count so high it takes many seconds to calculate just one image. Then there is only one single calculation setup, so this is likely to be the most advantageous scenario for GPU/CUDA.
To measure how much overhead there is, change image size to 1x1 and iteration limit 1, and then calculate enough images that it takes at least a few seconds. In this scenario, GPU is likely significantly slower.
To get most relevant timings for your use case, select image size and iteration count you are really going to use, and then measure the image count, where both versions are equally fast. That will give you a rough rule-of-thumb to decide which you should use when.
Alternative approach for practical results, if you are going to get just one image: find the iteration limit for single worst-case image, where CPU and GPU are equally fast. If that many or more iterations would be advantageous, choose GPU, otherwise choose CPU.
Upvotes: 3