Reputation: 123
I wrote a code in OpenCL to find the first 5000 prime numbers. Here's that code:
__kernel void dataParallel(__global int* A)
{
A[0]=2;
A[1]=3;
A[2]=5;
int pnp;//pnp=probable next prime
int pprime;//previous prime
int i,j;
for(i=3;i<5000;i++)
{
j=0;
pprime=A[i-1];
pnp=pprime+2;
while((j<i) && A[j]<=sqrt((float)pnp))
{
if(pnp%A[j]==0)
{
pnp+=2;
j=0;
}
j++;
}
A[i]=pnp;
}
}
Then I found out the execution time of this kernel code using OpenCL profiling. Here's the code:
cl_event event;//link an event when launch a kernel
ret=clEnqueueTask(cmdqueue,kernel,0, NULL, &event);
clWaitForEvents(1, &event);//make sure kernel has finished
clFinish(cmdqueue);//make sure all enqueued tasks finished
//get the profiling data and calculate the kernel execution time
cl_ulong time_start, time_end;
double total_time;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
//total_time = (cl_double)(time_end - time_start)*(cl_double)(1e-06);
printf("OpenCl Execution time is: %10.5f[ms] \n",(time_end - time_start)/1000000.0);
I ran these codes on various devices and this is what I got:
Platform:Intel(R) OpenCL
Device:Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
OpenCl Execution time is: 3.54796[ms]
Platform:AMD Accelerated Parallel Processing
Device:Pitcairn (AMD FirePro W7000 GPU)
OpenCl Execution time is: 194.18133[ms]
Platform:AMD Accelerated Parallel Processing
Device:Intel(R) Xeon(R) CPU X5660 @ 2.80GHz
OpenCl Execution time is: 3.58488[ms]
Platform:NVIDIA CUDA
Device:Tesla C2075
OpenCl Execution time is: 125.26886[ms]
But aren't GPUs supposed to be faster than CPUs? Or, is there anything wrong with my code/implementation? Please explain this behaviour.
Upvotes: 1
Views: 2542
Reputation: 326
The provided code is a sequential algorithm that relies on previous values. If you are running it with a global_work_size > 1, you are just performing same calculation over and over. The opencl implementation should compute primes less than N sequentially, then run a test in parallel for numbers [N+1; N*N] if they are divisible by any of those primes and fill in the sieve array with 0 if the number is not prime and 1 if the number is prime. E.g. not my code, someone's homework, and i did not check if it really works
If you need more than N^2 elements, calculate a prefix sum of the sieve array (exclusive scan).AMD APP SDK contains a sample of this operation. This will give you offsets of the prime numbers for copying into prime number array and you'll be able to populate it:
__kernel scatter(uint* numbers, uint* sieve_prefix_sum, uint* sieve, uint offset, uint* prime_numbers)
{
if (sieve[get_global_id(0)])
prime_numbers[offset + sieve_prefix_sum[get_global_id(0)] = numbers[get_global_id(0)];
}
This algorithm works like a tree - you compute primes up to N sequentially and then evaluate K blocks in [N+1, N*N] range, then you repeat and grow next set of branches for [N^2, N^4] etc.
Upvotes: 0
Reputation: 8410
clEnqueueTask()
So basically, you are running 1 single "thread"(work items) in the GPU. A GPU will never beat a CPU in single thread performance.
You need to convert your code, such that you divide each prime calculation to a thread and then you run 5000+ work items(ideally millions). Then, the GPU will beat the CPU simply because it will run all that in parallel and CPU can't.
In order to use multiple work items, call your kernel with clEnqueueNDRangeKernel()
Upvotes: 7