Reputation: 9868
I want to run a Spearman correlation of each column vs all the other columns in pandas. I need only the distribution of correlations (array), not a correlation matrix.
I know that I could use df.corr(method='spearman')
, however I need only the pairwise correlation, not the entire correlation matrix or the diagonal. I think this may speed up the computation, since I will be only computing ((N^2) - N)/2 correlations, instead of N^2.
However, this is just an assumption - since the matrix would be symmetric, maybe pandas already works by computing one half of the correlation matrix and then filling the rest accordingly.
By now my, very inefficient, solution is:
import pandas as pd
import scipy.stats as ss
# d is a pandas DataFrame
corr_a = []
for i, col1 in enumerate(d.columns):
for col2 in d.columns[i+1:]:
r, _ = ss.spearmanr(d.loc[col1], d.loc[col2])
corr_a += [r]
Is there any, builtin or vectorized, API to run this faster?
Upvotes: 1
Views: 987
Reputation: 9868
The pandas solution was actually easier than I thought:
import numpy as np
import pandas as pd
# d is a pandas DataFrame
d = d.corr(method='spearman')
d = d.where(np.triu(np.ones(d.shape)).astype(np.bool))
np.fill_diagonal(d.values, np.nan)
d = d.stack().reset_index()
corr = d.iloc[:, 2]
Feel free to edit if you can provide a way to compute only half of the correlation matrix (my original matrix is high dimensional so the computational cost of this solution may explode).
Upvotes: 1