Reputation: 2861
I wrote the following function to calculate a row by row correlation of a matrix with respect to a selected row (specified by the index
parameter):
# returns a 1D array of correlation coefficients whose length matches
# the row count of the given np_arr_2d
def ma_correlate_vs_index(np_arr_2d, index):
def corr_upper(x, y):
# just take the upper right corner of the correlation matrix
return numpy.ma.corrcoef(x, y)[0, 1]
return numpy.ma.apply_along_axis(corr_upper, 1, np_arr_2d, np_arr_2d[index, :])
The problem is that the code is very, very slow and I'm not sure how to improve the performance. I believe that the use of apply_along_axis
as well as the fact that corrcoef
is creating a 2D array are both contributing to the poor performance. Is there any more direct way to calculate that may give better performance?
In case it matters I'm using the ma
version of the functions to mask out some nan
values that are found in the data. Also, the shape of np_arr_2d
for my data is (623065, 72)
.
Upvotes: 2
Views: 1876
Reputation: 4707
I think you are right that there is a lot of overhead in corrcoef
. Essentially you just want the dot product of each row with the index row, normalized to a maximum of 1.0.
Something like this will work and will be much faster:
# Demean
demeaned = np_arr_2d - np_arr_2d.mean(axis=1)[:, None]
# Dot product of each row with index
res = np.ma.dot(demeaned, demeaned[index])
# Norm of each row
row_norms = np.ma.sqrt((demeaned ** 2).sum(axis=1))
# Normalize
res = res / row_norms / row_norms[index]
This runs much more quickly than your original code. I've used the masked array methods and so I think this will work with your data containing NaN.
There may be a tiny difference in the norms, controlled by ddof
in corrcoef
, in which case you can calculate the row_norms using np.ma.std
and specify the ddof
you want.
Upvotes: 4