How to improve very inefficient numpy code for calculating correlation

Question

I wrote the following function to calculate a row by row correlation of a matrix with respect to a selected row (specified by the index parameter):

# returns a 1D array of correlation coefficients whose length matches
# the row count of the given np_arr_2d
def ma_correlate_vs_index(np_arr_2d, index):
    def corr_upper(x, y):
        # just take the upper right corner of the correlation matrix
        return numpy.ma.corrcoef(x, y)[0, 1]

    return numpy.ma.apply_along_axis(corr_upper, 1, np_arr_2d, np_arr_2d[index, :])

The problem is that the code is very, very slow and I'm not sure how to improve the performance. I believe that the use of apply_along_axis as well as the fact that corrcoef is creating a 2D array are both contributing to the poor performance. Is there any more direct way to calculate that may give better performance?

In case it matters I'm using the ma version of the functions to mask out some nan values that are found in the data. Also, the shape of np_arr_2d for my data is (623065, 72).

cxrodgers · Accepted Answer

I think you are right that there is a lot of overhead in corrcoef. Essentially you just want the dot product of each row with the index row, normalized to a maximum of 1.0.

Something like this will work and will be much faster:

# Demean
demeaned = np_arr_2d - np_arr_2d.mean(axis=1)[:, None]

# Dot product of each row with index
res = np.ma.dot(demeaned, demeaned[index])

# Norm of each row
row_norms = np.ma.sqrt((demeaned ** 2).sum(axis=1))

# Normalize
res = res / row_norms / row_norms[index]

This runs much more quickly than your original code. I've used the masked array methods and so I think this will work with your data containing NaN.

There may be a tiny difference in the norms, controlled by ddof in corrcoef, in which case you can calculate the row_norms using np.ma.std and specify the ddof you want.

How to improve very inefficient numpy code for calculating correlation

Answers (1)

Related Questions