Maiss
Maiss

Reputation: 1681

Why is Arrayfun much faster than a for-loop when using GPU?

Could someone tell why Arrayfun is much faster than a for loop on GPU? (not on CPU, actually a For loop is faster on CPU)

Arrayfun:

x = parallel.gpu.GPUArray(rand(512,512,64));
count = arrayfun(@(x) x^2, x);

And equivalent For loop:

for i=1:size(x,1)*size(x,2)*size(x,3)
  z(i)=x(i).^2;        
end

Is it probably because a For loop is not multithreaded on GPU? Thanks.

Upvotes: 3

Views: 4659

Answers (3)

Lasin ART
Lasin ART

Reputation: 1

This is the time i got for the same code. Arrayfun in CPU take approx 17 sec which is much higher but in GPU, Arrayfun is much faster. parfor time = 0.4379 for time = 0.7237 gpu arrayfun time = 0.1685

Upvotes: 0

dcow
dcow

Reputation: 7975

I don't think your loops are equivalent. It seems you're squaring every element in an array with your CPU implementation, but performing some sort of count for arrayfun.

Regardless, I think the explanation you're looking for is as follows:

When run on the GPU, you code can be functionally decomposed -- into each array cell in this case -- and squared separately. This is okay because for a given i, the value of [cell_i]^2 doesn't depend on any of the other values in other cells. What most likely happens is the array get's decomposed into S buffers where S is the number of stream processing units your GPU has. Each unit then computes the square of the data in each cell of its buffer. The result is copied back to the original array and the result is returned to count.

Now don't worry, if you're counting things as it seems *array_fun* is actually doing, a similar thing is happening. The algorithm most likely partitions the array off into similar buffers, and, instead of squaring each cell, add the values together. You can think of the result of this first step as a smaller array which the same process can be applied to recursively to count the new sums.

Upvotes: 3

Edric
Edric

Reputation: 25140

As per the reference page here http://www.mathworks.co.uk/help/toolbox/distcomp/arrayfun.html, "the MATLAB function passed in for evaluation is compiled for the GPU, and then executed on the GPU". In the explicit for loop version, each operation is executed separately on the GPU, and this incurs overhead - the arrayfun version is one single GPU kernel invocation.

Upvotes: 2

Related Questions