Time complexity of element accessing in MATLAB

Question

The following two code snippets perform the same task (generating M samples uniformly from an N-dim sphere). I was wondering why the latter one consumes much more time than the previous one.

%% MATLAB R2014a    
M = 30;
N = 10000;    

#1
tic
S = zeros(M, N);
for k = 1:M
    P = ones(1, N);
    for i = 1:N - 1
        t = rand*2*pi;
        P(1:i) = P(1:i)*sin(t);
        P(i+1) = P(i+1)*cos(t);
    end
    S(k,:) = P;
end
toc

#2
tic
S = ones(M, N);
for k = 1:M
    for i = 1:N - 1
        t = rand*2*pi;
        S(k, 1:i) = S(k, 1:i)*sin(t);
        S(k, i+1) = S(k, i+1)*cos(t);
    end
end
toc

The output is:

Elapsed time is 15.007667 seconds.
Elapsed time is 59.745311 seconds.

And I also tried M = 1,

Elapsed time is 0.463370 seconds.
Elapsed time is 1.566913 seconds.

#2 is nearly 4 times slower than #1. Is frequent 2d element accessing in #2 making it time-consuming?

Peter · Accepted Answer

The time difference is due to memory access patterns, and how well they map onto the cache. And also possibly to MATLAB's exploitation of your hardware vector unit (SSE/AVX). MATLAB stores matrices "column-major", meaning S(2,1) is next to S(1,1).

In #1, you process each sample using the vector P, which lives in contiguous memory. These 80,000 bytes fit easily in L2 cache for the fast repeated access you need to perform. They're also neighbors, and trivially vectorized (I'm not certain if MATLAB performs this optimization, but I'd hope so...)

In #2, you access a row of S at a time, which is not contiguous, but rather is interleaved by M values. So each row is spread across 30*80,000 bytes, which does not fit in L2 cache. It'll have to be read back in for each repeated access, even though you're ignoring 29/30 values in that data.

Here's the test. All I'm doing it transposing S so that you can process a column at a time instead, then putting it back at the end just to get the same result:

#3
tic
S = ones(N, M);
for k = 1:M
    for i = 1:N - 1
        t = rand*2*pi;
        S(1:i, k) = S(1:i, k)*sin(t);
        S(i+1, k) = S(i+1, k)*cos(t);
    end
end
S = S.';
toc

Results:

Elapsed time is 11.254212 seconds.
Elapsed time is 45.847750 seconds.
Elapsed time is 11.501580 seconds.

Yep, transposing S gets us the same contiguous access and performance as the separate vector approach. By the way, L3 vs. L2 is about 4x more clock cycles... 1

Let's see if we can find any breakpoints related to cache size. Here's N = 1000, where everything should fit in L2:

Elapsed time is 0.240184 seconds.
Elapsed time is 0.373448 seconds.
Elapsed time is 0.258566 seconds.

Much lower difference, though now we're probably into L1 effects.

Finally, here's a completely different way to solve your problem. It relies on the fact that multivariate normal RV's have the correct symmetry.

#4
tic
S = randn(M, N);
S = bsxfun(@rdivide, S, sqrt(sum(S.*S, 2)));
toc

Elapsed time is 10.714104 seconds.
Elapsed time is 45.351277 seconds.
Elapsed time is 11.031061 seconds.
Elapsed time is 0.015068 seconds.

Time complexity of element accessing in MATLAB

Answers (2)

Related Questions