Reputation: 6097
I am trying to let a program run on my GPU and to start with an easy sample I modified the first sample on http://www.jocl.org/samples/samples.html and to run the following little script: I run n simultaneous "threads" (what's the correct name for the GPU equivalent of a thread?), each of which performs 20000000/n independent tanh()-computations. You can see my code here: http://pastebin.com/DY2pdJzL
The speed is by far not what I expected:
So after n=6 (be it n=8, n=20, n=100, n=1000 or n=100000), there is no performance increase, which means only 6 of these are computed in parallel. However, according to the specifications of my card there should be 80 cores: http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2
It is not a matter of overhead, since increasing or decreasing the 20000000 only matters a linear factor in all the execution times.
I have installed the AMD APP SDK and drivers that support OpenCL: see http://dl.dropbox.com/u/3060536/prtscr.png and http://dl.dropbox.com/u/3060536/prtsrc2.png for details (or at least I conclude from these that OpenCL is running correctly).
So I'm a bit clueless now, where to search for answer. Why can JOCL only do 6 parallel executions on my ATI Radeon HD 5450?
Upvotes: 0
Views: 405
Reputation: 2779
You are hard-coding the local work size to 1. Use a larger size or let the driver choose one for you.
Also, your kernel is not designed in an OpenCL style. You should take out the for loop and let the driver handle the iterating for you.
Upvotes: 1