Calculate Pearson correlation coefficient for only 1 column of array efficiently

Question

I have an array that has shape ~(700,36000)and would like to calculate the pearson correlation coefficient for only a specific column (against all other columns) but thousands of times. I've tried this a number of ways, but none seem to be that incredibly efficient:

import numpy 

df_corr = numpy.corrcoef(df.T)
corr_column = df_corr.iloc[:, column_index]

This of course calculates the entire correlation matrix, and takes ~12s on my machine; this is a problem, as I need to do this ~35,000 times (arr is changed slightly every time before creating the correlation matrix)!

I've also tried iterating over the columns individually:

corr_column = numpy.zeros(len(df))

for x in df.columns:
    corr_column[x] = numpy.corrcoef(x=p_subset.iloc[:,gene_ix],y=p_subset.iloc[:,x])[0][1]
    corr_column = vals.reshape(-1,1)

This is slightly faster at ~10s per iteration, but still too slow. Are there ways to find the correlation coefficient between a column and all other columns faster?

FBruzzesi · Accepted Answer

Well you can just implement the formula yourself:

import numpy as np

def corr(a, i):
    '''
    Parameters
    ----------
    a: numpy array
    i: column index

    Returns
    -------
    c: numpy array
       correlation coefficients of a[:,i] against all other columns of a
    '''

    mean_t = np.mean(a, axis=0)
    std_t = np.std(a, axis=0)

    mean_i = mean_t[i]
    std_i = std_t[i]

    mean_xy = np.mean(a*a[:,i][:,None], axis=0)

    c = (mean_xy - mean_i * mean_t)/(std_i * std_t)
    return c


a = np.random.randint(0,10, (700,36000))

%timeit corr(a,0)
608 ms ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.corrcoef(a.T)
# Actually didn't have the patience to let it finish in my machine 
# Using a smaller sample, the implementation above is 100x faster.

Calculate Pearson correlation coefficient for only 1 column of array efficiently

Answers (1)

Related Questions