Reputation: 61

MATLAB Speed Optimisation

Can anyone help? I am a fairly experienced Matlab user but am having trouble speeding up the code below.

The fastest time I have been able to achieve for one run through all three loops, using 12 cores, is ~200s. The actual function will be called ~720 times and at this rate will take over 40hrs to execute. According to the Matlab profiler, the majority of cpu time is spent in the exponential function call. I've managed to speed this up quite substantially using a gpuArray and then running the exp call on a Quadro 4000 graphics card however this then prevents the parfor loop from being used, since the workstation has only one graphics card, which obliterates any gains. Can anyone help, or is this code close to the optimum that can be achieved using Matlab? I have written a very crude c++ implementation with openMP but achieved little gain.

Many thanks in advance

function SPEEDtest_CPU

% Variable setup:
% - For testing I'll use random variables. These will actually be fed into 
%   the function for the real version of this code.
sy    = 320;
sx    = 100;
sz    = 32;
A     = complex(rand(sy,sx,sz),rand(sy,sx,sz));
B     = complex(rand(sy,sx,sz),rand(sy,sx,sz));
C     = rand(sy,sx);
D     = rand(sy*sx,1);
F     = zeros(sy,sx,sz);
x     = rand(sy*sx,1);  
y     = rand(sy*sx,1);
x_ind = (1:sx) - (sx / 2) - 1;
y_ind = (1:sy) - (sy / 2) - 1;


% MAIN LOOPS 
%  - In the real code this set of three loops will be called ~720 times!
%  - Using 12 cores, the fastest I have managed is ~200 seconds for one
%    call of this function.
tic
for z = 1 : sz
    A_slice = A(:,:,z);
    A_slice = A_slice(:);
    parfor cx = 1 : sx       
        for cy = 1 : sy       
            E = ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );                                                          

            F(cy,cx,z) = (B(cy,cx,z) .* exp(-1i .* E))' * A_slice; 
        end       
    end   
end
toc

end

Upvotes: 6

Answers (8)

GLK

Reputation: 1

This line: ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );

is some type of convolution, is it not? Circular convolution is much faster in the frequency domain, and the conversion to/from frequency domain is optimized using the FTT.

Upvotes: 0

Yair Altman

Reputation: 5722

In addition to the other good advise given here by others, the multiplication by A_slice is independent of the cx,cy loops and can be taken outside them, multiplying F once both loops have finished.

Similarly, the conjugation of B*exp(...) can also be done en-bulk outside the cx,cy loop, before multiplication by A_slice.

Upvotes: 0

Luis Mendo

Reputation: 112759

If your data are real (not complex), as in your example, you can save time replacing

(B(cy,cx,z) .* exp(-1i .* E))'

(B(cy,cx,z) .* (cos(E)+1i*sin(E))).'

Specifically, on my machine (cos(x)+1i*sin(x)).' takes 19% less time than exp(-1i .* x)'.

If A and B are complex: E is still real, so you can precompute Bconj = conj(B) outside the loops (this takes about 10 ms with your data size, and it's done only once) and then replace

(B(cy,cx,z) .* exp(-1i .* E))'

(Bconj(cy,cx,z) .* (cos(E)+1i*sin(E))).'

to obtain a similar gain.

Upvotes: 2

StrongBad

Reputation: 879

You can move x .* x_ind(cx) out of the innermost loop. I don't have a GPU handy to test the timings, but you could split the code into three sections to allow you to use the GPU and parfor

for z = 1 : sz
    E = zeros(sy*sx,sx,sy);
    A_slice = A(:,:,z);
    A_slice = A_slice(:);
    parfor cx = 1 : sx
        temp = ( x .* x_ind(cx) );       
        for cy = 1 : sy       
            E(:, cx, cy) = temp + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );                                                          
        end
    end
    temp = zeros(zeros(sy*sx,sx,sy));
    for cx = 1 : sx
        for cy = 1 : sy       
             % Ideally use your GPU magic here
             temp(:, cx, cy) = exp(-1i .* E(:, cx, cy)));
        end
    end
    parfor cx = 1 : sx
        for cy = 1 : sy       
            F(cy,cx,z) = (B(cy,cx,z) .* temp(:, cx, cy)' * A_slice; 
        end       
    end   
end

Upvotes: 1

bdecaf

Reputation: 4732

Not sure if it helps much with speed - but as E is basically a sum maybe you can use that exp (i cx(A+1)x) = exp(i cx(A) x) * exp(i x) and exp(i x) can be calculated beforehand.

That way you wouldn't have to evaluate exp each iteration - but just have to multiplicate, which should be faster.

Upvotes: 0

Dennis Jaheruddin

Reputation: 21563

To allow for proper paralellization you need to make sure the loops are completely independant, hence check whether not assigning to E in each run helps.

Furthermore try to vectorize as much as possible, one simple example could be: y.*y_ind(cy)

If you just create the proper index for all values at once, you can take this out of the lowest loop.

Upvotes: 0

Luke

Reputation: 98

There are two main ways of speeding up MATLAB code; preallocation and vectorisation.

You have preallocated well but there is no vectorisation. In order to best learn how to do this you need to have a good grasp of linear algebra and the use of repmat to expand vectors into multiple dimensions.

Vectorisation can result in multiple orders of magnitude speedup and will use the cores optimally (provided the flag is up).

What is the mathematical expression you are calculating and I may be able to lend a hand?

Upvotes: 1

Mikhail

Reputation: 8038

Some things to think about:

Have you considered using singles?

Can you vectorize the cx, cy portion so that they represent array operations?

Consider changing the floating point rounding or signalling modes.

Upvotes: 3

MATLAB Speed Optimisation

Answers (8)

Related Questions