Reputation: 1168
I want to obtain a matrix of partial correlatins (for all pairs), removing the effect of all other columns.
I am using pingouin
, however the function
df.pcorr().round(3)
only works with pearson correlation
.
Here is the code:
#!pip install pingouin
import pandas as pd
import pingouin as pg
df = pg.read_dataset('partial_corr')
print (df.pcorr().round(3)) #LIKE THIS BUT USING SPEARMAN CORRELATION
OUT: #like this one except obtained with SPEARMAN
x y cv1 cv2 cv3
x 1.000 0.493 -0.095 0.130 -0.385
y 0.493 1.000 -0.007 0.104 -0.002
cv1 -0.095 -0.007 1.000 -0.241 -0.470
cv2 0.130 0.104 -0.241 1.000 -0.118
cv3 -0.385 -0.002 -0.470 -0.118 1.00
Question: how do I make a partial correlation matrix for a pandas dataframe, excluding covariance of all other columns using SPEARMAN?
Upvotes: 0
Views: 1338
Reputation: 4939
You can use the fact that a partial correlation matrix is simply a correlation matrix of residuals when the pair of variables are fitted against the rest of the variables (see here).
You will need to get all the pairs - (itertools.combinations
will help here) and fit linear regression (sklearn
), get the spearman correlation on the residuals, then reshape the data to get the matrix.
Here is an example with the Iris Dataset that comes with sklearn
.
import pandas as pd
from sklearn.datasets import load_iris
from itertools import combinations
from sklearn import linear_model
#data
iris_data = load_iris()
iris_data = pd.DataFrame(iris_data['data'], columns=iris_data['feature_names'])
#get all the pairs of variables
xy_combinations = list(combinations(iris_data.columns, 2))
z = [[col for col in iris_data.columns if col not in xy] for xy in xy_combinations]
xyz_combinations = list(zip(xy_combinations, z))
#Compute spearman correlation
def part_corr(xyz):
var1, var2, rest = *xyz[0], xyz[1]
var1_reg = linear_model.LinearRegression().fit(iris_data[rest], iris_data[var1])
var2_reg = linear_model.LinearRegression().fit(iris_data[rest], iris_data[var2])
var1_res = iris_data[var1] - var1_reg.predict(iris_data[rest])
var2_res = iris_data[var2] - var2_reg.predict(iris_data[rest])
part_corr_df = pd.concat([var1_res, var2_res], axis=1).corr(method='spearman')
return part_corr_df.unstack()
# Reshaping data for square matrix form
part_corr_df = pd.DataFrame(pd.concat(list(map(part_corr, xyz_combinations))), columns=['part_corr']).reset_index()
part_corr_matrix = part_corr_df.pivot_table(values='part_corr', index='level_0', columns='level_1')
part_corr_matrix
Output
level_1 petal length (cm) petal width (cm) sepal length (cm) sepal width (cm)
level_0
petal length (cm) 1.000000 0.862649 0.681566 -0.633985
petal width (cm) 0.862649 1.000000 -0.303597 0.362407
sepal length (cm) 0.681566 -0.303597 1.000000 0.615629
sepal width (cm) -0.633985 0.362407 0.615629 1.000000
Upvotes: 1
Reputation: 3684
It would be helpful if you could add the first n rows of your table to recreate your dataframe.
However, you can calculate the partial correlation using pingouin.partial_corr()
by passing the method='spearman'
parameter.
Take a look at the examples here https://pingouin-stats.org/generated/pingouin.partial_corr.html
Upvotes: 0