Reputation: 757
I have a matrix (X
) of doubles containing time series. Some of the observations are set to NaN
when there is a missing value. I want to calculate the standard deviation per column to get a std dev value for each column. Since I have NaNs mixed in, a simple std(X)
will not work and if I try std(X(~isnan(X))
I end up getting the std dev for the entire matrix, instead of one per column.
Is there a way to simply omit the NaNs from std dev calculations along the 1st dim without resorting to looping?
Please note that I only want to ignore individual values as opposed to entire rows or cols in case of NaNs. Obviously I cannot set NaNs to zero or any other value as that would impact calculations.
Upvotes: 3
Views: 4172
Reputation: 5303
Have a look at nanstd
(stat toolbox).
The idea is to center the data using nanmean
, then to replace NaN with zero, and finally to compute the standard deviation.
See nanmean
below.
% maximum admissible fraction of missing values
max_miss = 0.6;
[m,n] = size(x);
% replace NaNs with zeros.
inan = find(isnan(x));
x(inan) = zeros(size(inan));
% determine number of available observations on each variable
[i,j] = ind2sub([m,n], inan); % subscripts of missing entries
nans = sparse(i,j,1,m,n); % indicator matrix for missing values
nobs = m - sum(nans);
% set nobs to NaN when there are too few entries to form robust average
minobs = m * (1 - max_miss);
k = find(nobs < minobs);
nobs(k) = NaN;
mx = sum(x) ./ nobs;
See nanstd
below.
flag = 1; % default: normalize by nobs-1
% center data
xc = x - repmat(mx, m, 1);
% replace NaNs with zeros in centered data matrix
xc(inan) = zeros(size(inan));
% standard deviation
sx = sqrt(sum(conj(xc).*xc) ./ (nobs-flag));
Upvotes: 2