Reputation: 31
I have recently implemented (Tested) OpenCL using a Struct to carry and update a C++ class object using a simple function written to the kernel and found to my dismay that the same function when processed without the kernel using a simple for loop was in fact faster.
Here is the kernel function :
__kernel void function_x_y_(__global myclass_* input,long n)
{
int gid = get_global_id(0);
if(gid<n)
input[gid].valuez = input[gid].valuey * input[gid].valuex * 8736;
}
Here is the for loop :
for(int i=0;i<100;i++){
thisclass[i].function_x_y();
}
and the class function :
void function_x_y(){
valuez = valuex * valuey;
}
I ran a clock on both process :
cout<<"Run function in serial\n";
startTime = clock();
for(int i=0;i<100;i++){
thisclass[i].function_x_y();
}
endTime = clock();
cout << "It took (serial) " << (endTime -startTime) / (CLOCKS_PER_SEC / 1000000) << " ms. " << endl;
cout<<"Run function in parallel using struct to write to object\n";
init_ocl();
startTime = clock();
load_kernel_from_struct("function_x_y_",p_struct,100); //Loads function and variables into opencl
endTime = clock();
cout << "It took (parallel) " << (endTime -startTime) / (CLOCKS_PER_SEC / 1000000 ) << " ms. " << endl;
With the output:
Run function in serial
It took (serial) 5 ms.
Run function in parallel using struct to write to object
It took (parallel) 159010 ms.
I am using the cl-helper.c by Andreas Kloecker
I dont understand this it should be faster. Any help or advice is welcome.
Is there a more accurate speed test? Could this be due to the fact that it takes time to initialise assign memory and transfer the data to the kernel?
There must be a way to ensure that this works faster could it be that I must transfer and initialise everything before running the function?
Thanks, Hbyte.
Upvotes: 1
Views: 248
Reputation: 20396
The fact that your original test is using only 100 elements to test with ought to be a pretty major clue as to what's happening, not least of which because of how much the timings changed when you bumped the number of iterations up to 5 million.
One thing I would suggest, incidentally, is to perform your test by only measuring the submission and retrieval of the work data to the GPU, and not the time spent compiling the kernel, since this will more accurately model the comparison between the host code (which has been compiled beforehand, obviously) and the device code.
And, of course, if you plan to take full advantage of GPGPU devices, you need to make sure the workload is actually large enough to benefit from the parallelism, even in spite of the setup overhead.
Upvotes: 1