Reputation: 535
I have an array that has shape ~(700,36000)
and would like to calculate the pearson correlation coefficient for only a specific column (against all other columns) but thousands of times. I've tried this a number of ways, but none seem to be that incredibly efficient:
import numpy
df_corr = numpy.corrcoef(df.T)
corr_column = df_corr.iloc[:, column_index]
This of course calculates the entire correlation matrix, and takes ~12s on my machine; this is a problem, as I need to do this ~35,000 times (arr is changed slightly every time before creating the correlation matrix)!
I've also tried iterating over the columns individually:
corr_column = numpy.zeros(len(df))
for x in df.columns:
corr_column[x] = numpy.corrcoef(x=p_subset.iloc[:,gene_ix],y=p_subset.iloc[:,x])[0][1]
corr_column = vals.reshape(-1,1)
This is slightly faster at ~10s per iteration, but still too slow. Are there ways to find the correlation coefficient between a column and all other columns faster?
Upvotes: 1
Views: 1213
Reputation: 6475
Well you can just implement the formula yourself:
import numpy as np
def corr(a, i):
'''
Parameters
----------
a: numpy array
i: column index
Returns
-------
c: numpy array
correlation coefficients of a[:,i] against all other columns of a
'''
mean_t = np.mean(a, axis=0)
std_t = np.std(a, axis=0)
mean_i = mean_t[i]
std_i = std_t[i]
mean_xy = np.mean(a*a[:,i][:,None], axis=0)
c = (mean_xy - mean_i * mean_t)/(std_i * std_t)
return c
a = np.random.randint(0,10, (700,36000))
%timeit corr(a,0)
608 ms ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.corrcoef(a.T)
# Actually didn't have the patience to let it finish in my machine
# Using a smaller sample, the implementation above is 100x faster.
Upvotes: 1