Reputation: 192
I have GPU accelerated MATLAB code that spends 80%-90% of its time computing
sum(a.*exp(b.*c),1)
where
size(a) = [n 1]
size(b) = [n 1]
size(c) = [1 m]
n can be chosen to be arbitrarily large (within memory constraints)
5000 < m < 20000
I'd like to speed this up by more than just using gpuArrays (approx 17x for double precision).
Using MATLAB 2018b and an NVIDIA P100 GPU, I have run the following script aiming to find the optimal size of n. It shows that I achieve a 17x speedup compared to CPU (Dual socket Intel Xeon E5-2650v2) using double precision. Can I improve on this by doing something more advanced, like using GPU coder, or even shared memory or texture memory as described in the following? https://uk.mathworks.com/help/parallel-computing/examples/accessing-advanced-cuda-features-using-mex.html
%% Optimisation MWE
nVec = 1000:1000:60000; % Vector of candidate n values
m = 5000;
f1 = figure(1);
ax(1) = subplot(3,1,1);
ax(2) = subplot(3,1,2);
ax(3) = subplot(3,1,3);
% Preallocate time outputs
t = nan(length(nVec),3);
speedupGPU = nan(length(nVec),2);
% Loop over candidate n values
for n = 1:length(nVec)
%% CPU code
a = rand(nVec(n),1);
b = rand(nVec(n),1);
c = rand(1,m);
f1 = @() sum(a.*exp(b.*c),1);
t(n,1) = timeit(f1,1);
%% GPU code (double precision)
a = gpuArray(a);
b = gpuArray(b);
c = gpuArray(c);
f2 = @() sum(a.*exp(b.*c),1);
t(n,2) = gputimeit(f2);
%% GPU code (single precision)
a = single(a);
b = single(b);
c = single(c);
f3 = @() sum(a.*exp(b.*c),1);
t(n,3) = gputimeit(f3);
%% Calculate speedup
speedupGPU(n,1) = t(n,1)/t(n,2);
speedupGPU(n,2) = t(n,1)/t(n,3);
%% Plot
plot(ax(1),nVec,t,'.-') % Plot compute time
plot(ax(2),nVec,t./nVec(:),'.-') % Plot normalised compute time
plot(ax(3),nVec,speedupGPU,'.-') % Plot Speedup
%% Label plots
xlabel(ax(1),'n')
ylabel(ax(1),'Time')
legend(ax(1),'CPU','GPU double','GPU single')
xlabel(ax(2),'n')
ylabel(ax(2),'Normalised Time')
legend(ax(2),'CPU','GPU double','GPU single')
xlabel(ax(3),'n')
ylabel(ax(3),'Speedup')
legend(ax(3),'CPU/GPU double','CPU/GPU single')
drawnow
end
This results in the following figure (top: Execution time with increasing n (smaller is better), middle: Execution time normalised by n (smaller is better), bottom: speedup relative to CPU (larger is better)):
Upvotes: 4
Views: 213
Reputation: 24169
I realize it might not give you the speedup you're looking for, but one way to make this code more performant is to get rid of the sum
by using matrix multiplication:
sum(a.*exp(b.*c),1) --> a.'*exp(b.*c)
On my system this resulted in a speedup increase from ~10 to ~15.
I should also mention, that for the lowest value of n
, I got a ~20 speedup by also replacing the array multiplication (.*
) with matrix multiplication (*
): a.'*exp(b.*c) --> a.'*exp(b*c)
.
Tested on R2019b, Win10, GTX660.
Upvotes: 4