Reputation: 61
Can anyone help? I am a fairly experienced Matlab user but am having trouble speeding up the code below.
The fastest time I have been able to achieve for one run through all three loops, using 12 cores, is ~200s. The actual function will be called ~720 times and at this rate will take over 40hrs to execute. According to the Matlab profiler, the majority of cpu time is spent in the exponential function call. I've managed to speed this up quite substantially using a gpuArray and then running the exp call on a Quadro 4000 graphics card however this then prevents the parfor loop from being used, since the workstation has only one graphics card, which obliterates any gains. Can anyone help, or is this code close to the optimum that can be achieved using Matlab? I have written a very crude c++ implementation with openMP but achieved little gain.
Many thanks in advance
function SPEEDtest_CPU
% Variable setup:
% - For testing I'll use random variables. These will actually be fed into
% the function for the real version of this code.
sy = 320;
sx = 100;
sz = 32;
A = complex(rand(sy,sx,sz),rand(sy,sx,sz));
B = complex(rand(sy,sx,sz),rand(sy,sx,sz));
C = rand(sy,sx);
D = rand(sy*sx,1);
F = zeros(sy,sx,sz);
x = rand(sy*sx,1);
y = rand(sy*sx,1);
x_ind = (1:sx) - (sx / 2) - 1;
y_ind = (1:sy) - (sy / 2) - 1;
% MAIN LOOPS
% - In the real code this set of three loops will be called ~720 times!
% - Using 12 cores, the fastest I have managed is ~200 seconds for one
% call of this function.
tic
for z = 1 : sz
A_slice = A(:,:,z);
A_slice = A_slice(:);
parfor cx = 1 : sx
for cy = 1 : sy
E = ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
F(cy,cx,z) = (B(cy,cx,z) .* exp(-1i .* E))' * A_slice;
end
end
end
toc
end
Upvotes: 6
Views: 799
Reputation: 1
This line: ( x .* x_ind(cx) ) + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
is some type of convolution, is it not? Circular convolution is much faster in the frequency domain, and the conversion to/from frequency domain is optimized using the FTT.
Upvotes: 0
Reputation: 5722
In addition to the other good advise given here by others, the multiplication by A_slice
is independent of the cx,cy
loops and can be taken outside them, multiplying F
once both loops have finished.
Similarly, the conjugation of B*exp(...)
can also be done en-bulk outside the cx,cy
loop, before multiplication by A_slice
.
Upvotes: 0
Reputation: 112759
If your data are real (not complex), as in your example, you can save time replacing
(B(cy,cx,z) .* exp(-1i .* E))'
by
(B(cy,cx,z) .* (cos(E)+1i*sin(E))).'
Specifically, on my machine (cos(x)+1i*sin(x)).'
takes 19% less time than exp(-1i .* x)'
.
If A
and B
are complex: E
is still real, so you can precompute Bconj = conj(B)
outside the loops (this takes about 10 ms with your data size, and it's done only once) and then replace
(B(cy,cx,z) .* exp(-1i .* E))'
by
(Bconj(cy,cx,z) .* (cos(E)+1i*sin(E))).'
to obtain a similar gain.
Upvotes: 2
Reputation: 879
You can move x .* x_ind(cx)
out of the innermost loop. I don't have a GPU handy to test the timings, but you could split the code into three sections to allow you to use the GPU and parfor
for z = 1 : sz
E = zeros(sy*sx,sx,sy);
A_slice = A(:,:,z);
A_slice = A_slice(:);
parfor cx = 1 : sx
temp = ( x .* x_ind(cx) );
for cy = 1 : sy
E(:, cx, cy) = temp + ( y .* y_ind(cy) ) + ( C(cy,cx) .* D );
end
end
temp = zeros(zeros(sy*sx,sx,sy));
for cx = 1 : sx
for cy = 1 : sy
% Ideally use your GPU magic here
temp(:, cx, cy) = exp(-1i .* E(:, cx, cy)));
end
end
parfor cx = 1 : sx
for cy = 1 : sy
F(cy,cx,z) = (B(cy,cx,z) .* temp(:, cx, cy)' * A_slice;
end
end
end
Upvotes: 1
Reputation: 4732
Not sure if it helps much with speed - but as E is basically a sum maybe you can use that exp (i cx(A+1)x) = exp(i cx(A) x) * exp(i x)
and exp(i x)
can be calculated beforehand.
That way you wouldn't have to evaluate exp each iteration - but just have to multiplicate, which should be faster.
Upvotes: 0
Reputation: 21563
To allow for proper paralellization you need to make sure the loops are completely independant, hence check whether not assigning to E
in each run helps.
Furthermore try to vectorize as much as possible, one simple example could be: y.*y_ind(cy)
If you just create the proper index for all values at once, you can take this out of the lowest loop.
Upvotes: 0
Reputation: 98
There are two main ways of speeding up MATLAB code; preallocation and vectorisation.
You have preallocated well but there is no vectorisation. In order to best learn how to do this you need to have a good grasp of linear algebra and the use of repmat
to expand vectors into multiple dimensions.
Vectorisation can result in multiple orders of magnitude speedup and will use the cores optimally (provided the flag is up).
What is the mathematical expression you are calculating and I may be able to lend a hand?
Upvotes: 1
Reputation: 8038
Some things to think about:
Have you considered using singles?
Can you vectorize the cx, cy portion so that they represent array operations?
Consider changing the floating point rounding or signalling modes.
Upvotes: 3