Reputation: 87
all. I have just one GPU device Nvidia GTX 750. I did a test that copy data from CPU to GPU in one single thread with using clEnqueueWriteBuffer. And then I did it by using multiple threads. The result is that multiple threads seems slower. When using multiple threads, every thread has its own kernel/command queue/context which created by the same device. So my question is that is the clEnqueueWriteBuffer call has some lock for one device? How can I reduce those effection?
Upvotes: 1
Views: 1182
Reputation: 11920
Edit: if workloads are too light for the hardware, multiple concurrent command queues can achieve better total bandwidth.
Like opengl, opencl needs to batch multiple buffers into single one to get faster, even using single opencl kernel parameter versus multiple parameters is faster. Because there is operating system/api overhead per operation. Moving bigger but fewer chunks is better.
You could have bought two graphics cards that are equivalent to a gtx 750 when combined, to use multiple pci-e bandwidths (if your mainboard can give two 16x lanes separately)
Pcie lanes are two way so you can try parallelize writes and reads or parallelize visualization and computation or parallelize compute and writes or parallelize compute and reads or parallelize compute+write+read (ofcourse if they are not dependent each other like figure 1-a) if there are such in your algorithm and if your graphics card can do it.
Once I tried divide and conquer on a big array to calculate and sending each part to gpu, it took seconds. Now Im computing with just single call for writes single call for computes. Taking only milliseconds.
Figure 1-a:
write iteration --- compute iteration ---- read iteration --- parallels
1 - - 1
2 1 - 2
3 2 1 3
4 3 2 3
5 4 3 3
6 5 4 3
if there is no dependency between iterations. If there is a dependency, then:
figure 1-b:
write iteration --- compute iteration ---- read iteration --- parallels
half of 1 - - 1
other half of 1 half of 1 - 2
half of 2 other half of 1 half of 1 3
other half of 2 half of 2 other half of 1 3
half of 3 other half of 2 half of 2 3
other half of 3 half of 3 other half of 2 3
If you need parallelization between batches of images with non-constant sizes:
cpu to gpu -------- gpu to gpu ----- compute ----- gpu to cpu
1,2,3,4,5 - - -
- 1,2,3 - -
- 4,5 1,2,3 -
- - 4,5 1,2,3
6,7,8,9 - - 4,5
10,11,12 6,7,8 - -
13,14 9,10,11 6,7 -
15,16,17,18 12,13,14 8,9,10 6
Upvotes: 2