Reputation: 137
I have the following code to calculate the semi-variance (dis-correlation) between scattered data points separated by a given distance in space (spatial lag) and in time (needs to be < 1, i.e., < 1 day). I need to calculate the semi-variance between a single data point and all of the other data points for each spatial lag.
However, it is very slow and I need to do this outside of a loop if possible. The second loop is relatively quick, it's the first loop, having to go through each element one-by-one, that is slow. Any ideas on how I can improve this code making it quicker? I use a datapoints array that is 90 times larger than the one used in this example.
I'm using MATLAB R2016a.
Working code:
N = 1E2; % example problem size
D = 1:N; % distance
t = rand(1,N)+0.2; % time
datapoints = rand(1,N)+20; % temperature
% obtain differences between each element
distance_diff = abs(bsxfun(@minus, D.', D));
time_diff = abs(bsxfun(@minus, t.', t));
D_lag = [5 10 15 20 25]; % spatial lag, need to perform calculation for each lag
% the first loop is the one that slows down my function
for nn = 1:length(datapoints) % for each element
for lag_n = 1:length(D_lag) % for each distance lag
sl = D_lag(lag_n);
d = distance_diff(nn,:) < sl & time_diff(nn,:) < 1; % select points within spatial lag and 1 day
strfile(nn,lag_n).variable = datapoints(d); % save selected data points
% calculate semi-variogram for each nn element and each lag_n
a = datapoints(nn)-strfile(nn,lag_n).variable; % for each element in data points
strfile(nn,lag_n).semi_var = 0.5*((nanmean(a)^0.5)^4)/ ...
(0.457+(0.494/length(strfile(nn,lag_n).variable)));
end
end
Upvotes: 0
Views: 65
Reputation: 24169
It is possible to improve your code while keeping it in its current loop form, and it seems possible to get rid of the loops altogether. Since it would require a deeper understanding of your problem and the structure of your data, I will not show a vectorized version at this time (with the hope of adding it sometime in the future), but will try to explain how to get there.
I'll be using implicit expansion syntax (so no bsxfun
) in this answer (works in R2016b onwards).
Here are the things I noticed immediately that can be improved in your solution:
MATLAB warns about growing strfile
inside the inner loop. The easiest way to avoid it is to invert the loop order, then this:
for nn = 1:length(datapoints) % for each element
for lag_n = 1:length(D_lag) % for each distance lag
turns into this:
for nn = numel(datapoints):-1:1 % for each element
for lag_n = numel(D_lag):-1:1 % for each distance lag
(I'm using the numel
function as it is more expressive than length
)
Since you only ever check if the time difference is smaller than a single value (1
), you absolutely don't need to do it inside the inner loop. In fact, you can even move it outside the loops entirely, so instead of:
for nn = ...
for lag_n = ...
sl = D_lag(lag_n);
d = distance_diff(nn,:) < sl & time_diff(nn,:) < 1;
You can do this:
td = time_diff < 1; % or even directly: td = abs(t.'-t) < 1
for nn = ...
ttd = td(nn,:); % doesn't change each lag_n, no need to recompute
for lag_n = ...
d = distance_diff(nn,:) < D_lag(lag_n) & ttd;
The above is the first step towards vectorization. In vectorization, you want to work in a "SIMD" approach, that is, issue a command once and let it work on as large a chunk of data as possible, ideally your entire dataset at once.
The above two changes are a result of a careful consideration of memory allocation and data dependencies, and yield about a 50% reduction in runtime.
Now, as promised, few pointers regarding vectorization:
If we convert D_lag
into a vector along the 3rd dimension, we could compute all required values of d
in one go:
d = distance_diff < permute(D_lag,[1,3,2]) & time_diff < 1; % ...or simply
d = abs(D.'-D) < permute(D_lag,[1,3,2]) & abs(t.'-t) < 1;
length
, you would use size(strfile(...),#)
.nanmean(x)
you would use the multi-argument syntax as discussed in this Q&A.I think it would make sense to create the struct
output after all computations are complete, using:
strfile = struct('variable', (...), 'semi_var',(...) );
Here's my code in case anybody wants to benchmark or build upon it:
function q49779404
N = 1E4; % example problem size
D = 1:N; % distance
t = rand(1,N)+0.2; % time
datapoints = rand(1,N)+20; % temperature
% obtain differences between each element
distance_diff = abs(D.'-D); % implicit expansion
time_diff = abs(t.'-t); % implicit expansion
D_lag = [5 10 15 20 25]; % spatial lag, need to perform calculation for each lag
out{1} = method1(datapoints, D_lag, distance_diff, time_diff);
out{2} = method2(datapoints, D_lag, distance_diff, time_diff);
% out{3} = method3(datapoints, D_lag, distance_diff, time_diff);
function strfile = method1(datapoints, D_lag, distance_diff, time_diff)
for nn = 1:length(datapoints) % for each element
for lag_n = 1:length(D_lag) % for each distance lag
sl = D_lag(lag_n);
d = distance_diff(nn,:) < sl & time_diff(nn,:) < 1; % select points within spatial lag and 1 day
strfile(nn,lag_n).variable = datapoints(d); % save selected data points
% calculate semi-variogram for each nn element and each lag_n
a = datapoints(nn)-strfile(nn,lag_n).variable; % for each element in data points
strfile(nn,lag_n).semi_var = 0.5*((nanmean(a)^0.5)^4) / ...
(0.457+(0.494/length(strfile(nn,lag_n).variable)));
end
end
function strfile = method2(datapoints, D_lag, distance_diff, time_diff)
td = time_diff < 1;
for nn = numel(datapoints):-1:1 % for each element
ttd = td(nn,:); % doesn't change each lag_n, no need to recompute
for lag_n = numel(D_lag):-1:1 % for each distance lag
d = distance_diff(nn,:) < D_lag(lag_n) & ttd; % select points within spatial lag and 1 day
strfile(nn,lag_n).variable = datapoints(d); % save selected data points
% calculate semi-variogram for each nn element and each lag_n
a = datapoints(nn)-strfile(nn,lag_n).variable; % for each element in data points
strfile(nn,lag_n).semi_var = 0.5*((nanmean(a).^0.5).^4) ./ ...
(0.457+(0.494./length(strfile(nn,lag_n).variable)));
end
end
function strfile = method3(datapoints, D_lag, distance_diff, time_diff)
d = distance_diff < permute(D_lag,[1,3,2]) & time_diff < 1;
strfile = struct('variable',[],'semi_var',[]);
Upvotes: 3