Reputation: 21
I am wondering about the big performance difference of a fft and a simple addition on a GPU using Matlab. I would expect that a fft is slower on the GPU than a simple addition. But why is it the other way around? Any suggestions?
a=rand(2.^20,1);
a=gpuArray(a);
b=gpuArray(0);
c=gpuArray(1);
tic % should take a long time
for k=1:1000
fft(a);
end
toc % Elapsed time is 0.085893 seconds.
tic % should be fast, but isn't
for k=1:1000
b=b+c;
end
toc % Elapsed time is 1.430682 seconds.
It is also interesting to note that the computational time for the addition (second loop) decreases if I reduce the length of the vetor a.
EDIT
If I change the order of the two loops, i.e. if the addition is done first, the addition takes 0.2 seconds instead of 1.4 seconds. The FFT time is still the same.
Upvotes: 2
Views: 981
Reputation: 4557
I don't have a 2012b MATLAB with GPU to hand to check this but I think that you are missing a wait() command. In 2012a, MATLAB introduced asynchronous GPU calculations. So, when you send something to the GPU it doesn't wait until its finished before moving on in code. Try this:
mygpu=gpuDevice(1);
a=rand(2.^20,1);
a=gpuArray(a);
b=gpuArray(0);
c=gpuArray(1);
tic % should take a long time
for k=1:1000
fft(a);
end
wait(mygpu); %Wait until the GPU has finished calculating before moving on
toc
tic % should be fast
for k=1:1000
b=b+c;
end
wait(mygpu); %Wait until the GPU has finished calculating before moving on
toc
The computation time of the addition should no longer depend on when its carried out. Would you mind checking and getting back to me please?
Upvotes: 0
Reputation: 10708
I'm guessing that Matlab isn't actually running the fft because the output is not used anywhere. Also, in your simple addition loop, each iteration depends on the previous one, so it has to run serially.
I don't know why the order of the loops matters. Maybe it has something to do with cleaning up the GPU memory after the first loop. You could try calling pause(1)
between the loops to let your computer get back to an idle state before the second loop. That may make your timing more consistent.
Upvotes: 1