whitepearl
whitepearl

Reputation: 644

Program execution taking almost same usertime on CPU as well as GPU?

The program for finding prime numbers using OpenCL 1.1 gave the following benchmarks :

Device : CPU

Realtime : approx. 3 sec Usertime : approx. 32 sec

Device : GPU

Realtime - approx. 37 sec Usertime - approx. 32 sec

Why is the usertime of execution by GPU not less than that of CPU? Is data/task parallelization not occuring?

System specifications :64-bit CentOS 5.3 system with two ATI Radeon 5970 graphics card + Intel Core i7 processor(12 cores)

Upvotes: 2

Views: 820

Answers (1)

mfa
mfa

Reputation: 5087

Your kernel is rather inefficient, I have an adjusted one below for you to consider. As to why it runs better on a cpu device...

  1. Using your algorithm, the work items take varying amounts of time to execute. They will take longer as the numbers tested grow larger. A work group on a gpu will not finish until all of its items are finished some of the hardware will be left idle until the last item is done. On a cpu, it behaves more like a loop iterating over the kernel items, so the difference in cycles needed to compute each item won't drastically affect the performance.
  2. 'A' is not used by the kernel. It should not be copied unless it is used. It looks like you wanted to test the A[i] rather then 'i' itself though.

I think the gpu would be much better at FFT-based prime calculations, or even a sieve algorithm.

{
    int t;
    int i = get_global_id(0);
    int end = sqrt(i);

    if(i%2){
        B[i] = 0;
    }else{
        B[i] = 1; //assuming only that it should be non-zero
    }
    for ( t = 3; (t<=end)&&(B[i] > 0) ; t+=2 ) {
        if ( i % t == 0 ) {
            B[ i ] = 0;
        }
    }
}

Upvotes: 1

Related Questions