Reputation: 38
I am performing a benchmark like show below
CHECK( context = clCreateContext(props, 1, &device, NULL, NULL, &_err); );
CHECK( queue = clCreateCommandQueue(context, device, 0, &_err); );
#define SYNC() clFinish(queue)
#define LAUNCH(glob, loc, kernel) OCL(clEnqueueNDRangeKernel(queue, kernel, 2,\
NULL, glob, loc,\
0, NULL, NULL))
/* Build program, set arguments over here */
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, plus_kernel);
}
SYNC();
STOP;
printf("Time taken (plus) : %lf\n", uSec / iter);
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, minus_kernel);
}
SYNC();
STOP;
printf("Time taken (minus): %lf\n", uSec / iter);
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, plus_kernel);
LAUNCH(global, local, minus_kernel);
}
SYNC();
STOP;
printf("Time taken (both) : %lf\n", uSec / iter);
The results look weird:
Time taken (plus) : 31.450000
Time taken (minus): 28.120000
Time taken (both) : 2256.380000
START
, and STOP
are just macros that start and stop a timer.
Here are the relevant macros.
I am not sure why queuing up is the kernels is slowing them down (and only on AMD GPUs)!
EDIT I am using Radeon 7970
EDIT Both kernels are operating on independent memory. Also here is the system information.
OS: Ubuntu 11.10
fglrxinfo:
display: :0 screen: 0
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: AMD Radeon HD 7900 Series
OpenGL version string: 4.2.11762 Compatibility Profile Context
Upvotes: 1
Views: 236
Reputation: 6178
I think the answer has to do with caching of data on newer GPUs (Specifically the Radeon 7970, which uses the Graphics Compute Next (GCN) architecture.
One of the advantages of this architecture is it's caching capabilities (somewhat close to CPU caching at this point). If you perform calls like this:
PLUS
PLUS
PLUS
....
Then the memory that is resident in the inner caches of the GPU. On the other hand if you make calls like this:
PLUS
MINUS
PLUS
MINUS
...
Where the two kernels have different memory objects associated with them, then the data is kicked out of the hardware devices on each CU, causing a need for them to be brought in from the very sluggish global memory.
Two easy ways to test if this is the case:
Run only Pluses with varying numbers of iterations. As the number of iterations increases, the average time will go down because the cost of the first run (which brings the data in) is amortized. Also, you should notice that all calls after the first should be relatively equal.
Make the Plus and Minus kernels run on the same memory objects. If the reason for the slowdown is because of the caching of memory objects, then the overall run time should be the average of the individual running times of PLUS and MINUS (depending perhaps on experiment 1).
Let me know if you find out if this is actually the case!
Upvotes: 1