B. Thomas
B. Thomas

Reputation: 192

Efficient way to calculate sum(a.*exp(b.*c),1) using MATLAB's GPU functionality

I have GPU accelerated MATLAB code that spends 80%-90% of its time computing

sum(a.*exp(b.*c),1)

where

size(a) = [n 1]
size(b) = [n 1]
size(c) = [1 m]

n can be chosen to be arbitrarily large (within memory constraints)

5000 < m < 20000

I'd like to speed this up by more than just using gpuArrays (approx 17x for double precision).

Benchmarking

Using MATLAB 2018b and an NVIDIA P100 GPU, I have run the following script aiming to find the optimal size of n. It shows that I achieve a 17x speedup compared to CPU (Dual socket Intel Xeon E5-2650v2) using double precision. Can I improve on this by doing something more advanced, like using GPU coder, or even shared memory or texture memory as described in the following? https://uk.mathworks.com/help/parallel-computing/examples/accessing-advanced-cuda-features-using-mex.html

%% Optimisation MWE

nVec = 1000:1000:60000; % Vector of candidate n values
m = 5000;

f1 = figure(1);
ax(1) = subplot(3,1,1);
ax(2) = subplot(3,1,2);
ax(3) = subplot(3,1,3);

% Preallocate time outputs
t = nan(length(nVec),3);
speedupGPU = nan(length(nVec),2);

% Loop over candidate n values
for n = 1:length(nVec)

    %% CPU code
    a = rand(nVec(n),1);
    b = rand(nVec(n),1);
    c = rand(1,m);

    f1 = @() sum(a.*exp(b.*c),1);

    t(n,1) = timeit(f1,1);

    %% GPU code (double precision)
    a = gpuArray(a);
    b = gpuArray(b);
    c = gpuArray(c);

    f2 = @() sum(a.*exp(b.*c),1);

    t(n,2) = gputimeit(f2);

    %% GPU code (single precision)
    a = single(a);
    b = single(b);
    c = single(c);

    f3 = @() sum(a.*exp(b.*c),1);

    t(n,3) = gputimeit(f3);

    %% Calculate speedup
    speedupGPU(n,1) = t(n,1)/t(n,2);
    speedupGPU(n,2) = t(n,1)/t(n,3);

    %% Plot
    plot(ax(1),nVec,t,'.-')             % Plot compute time
    plot(ax(2),nVec,t./nVec(:),'.-')    % Plot normalised compute time
    plot(ax(3),nVec,speedupGPU,'.-')    % Plot Speedup

    %% Label plots
    xlabel(ax(1),'n')
    ylabel(ax(1),'Time')
    legend(ax(1),'CPU','GPU double','GPU single')

    xlabel(ax(2),'n')
    ylabel(ax(2),'Normalised Time')
    legend(ax(2),'CPU','GPU double','GPU single')

    xlabel(ax(3),'n')
    ylabel(ax(3),'Speedup')
    legend(ax(3),'CPU/GPU double','CPU/GPU single')

    drawnow

end

This results in the following figure (top: Execution time with increasing n (smaller is better), middle: Execution time normalised by n (smaller is better), bottom: speedup relative to CPU (larger is better)):

Execution time, normalised execution time, and speedup over various **n**

Upvotes: 4

Views: 213

Answers (1)

Dev-iL
Dev-iL

Reputation: 24169

I realize it might not give you the speedup you're looking for, but one way to make this code more performant is to get rid of the sum by using matrix multiplication:

sum(a.*exp(b.*c),1) --> a.'*exp(b.*c)

On my system this resulted in a speedup increase from ~10 to ~15.

I should also mention, that for the lowest value of n, I got a ~20 speedup by also replacing the array multiplication (.*) with matrix multiplication (*): a.'*exp(b.*c) --> a.'*exp(b*c).

Tested on R2019b, Win10, GTX660.

Upvotes: 4

Related Questions