Reputation: 1681
Could someone tell why Arrayfun is much faster than a for loop on GPU? (not on CPU, actually a For loop is faster on CPU)
Arrayfun:
x = parallel.gpu.GPUArray(rand(512,512,64));
count = arrayfun(@(x) x^2, x);
And equivalent For loop:
for i=1:size(x,1)*size(x,2)*size(x,3)
z(i)=x(i).^2;
end
Is it probably because a For loop is not multithreaded on GPU? Thanks.
Upvotes: 3
Views: 4659
Reputation: 1
This is the time i got for the same code. Arrayfun
in CPU take approx 17 sec which is much higher but in GPU, Arrayfun
is much faster.
parfor
time = 0.4379
for time = 0.7237
gpu
arrayfun
time = 0.1685
Upvotes: 0
Reputation: 7975
I don't think your loops are equivalent. It seems you're squaring every element in an array with your CPU implementation, but performing some sort of count for arrayfun.
Regardless, I think the explanation you're looking for is as follows:
When run on the GPU, you code can be functionally decomposed -- into each array cell in this case -- and squared separately. This is okay because for a given i
, the value of [cell_i]^2
doesn't depend on any of the other values in other cells. What most likely happens is the array get's decomposed into S buffers where S is the number of stream processing units your GPU has. Each unit then computes the square of the data in each cell of its buffer. The result is copied back to the original array and the result is returned to count.
Now don't worry, if you're counting things as it seems *array_fun* is actually doing, a similar thing is happening. The algorithm most likely partitions the array off into similar buffers, and, instead of squaring each cell, add the values together. You can think of the result of this first step as a smaller array which the same process can be applied to recursively to count the new sums.
Upvotes: 3
Reputation: 25140
As per the reference page here http://www.mathworks.co.uk/help/toolbox/distcomp/arrayfun.html, "the MATLAB function passed in for evaluation is compiled for the GPU, and then executed on the GPU". In the explicit for
loop version, each operation is executed separately on the GPU, and this incurs overhead - the arrayfun
version is one single GPU kernel invocation.
Upvotes: 2