Reputation: 691
What is the best solution to compute correlation between my features and target variable ?? My dataframe have 1000 rows and 40 000 columns...
Exemple :
df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]], columns=['Feature1', 'Feature2','Feature3','Target'])
This code works fine but this is too long on my dataframe ... I need only the last column of correlation matrix : correlation with target (not pairwise feature corelation).
corr_matrix=df.corr()
corr_matrix["Target"].sort_values(ascending=False)
The np.corcoeff() function works with array but can we exclude the pairwise feature correlation ?
Upvotes: 13
Views: 51510
Reputation: 6302
Since Pandas 0.24 released in January 2019, you can simply use DataFrame.corrwith()
:
df.corrwith(df["Target"])
Upvotes: 21
Reputation: 7
df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]], columns=['Feature1', 'Feature2','Feature3','Target'])
For correlation between your target variable and all other features:
df.corr()['Target']
This works in my case. Let me know if any corrections/updates on the same.
To get any conclusive results your instance should be atleast 10 times your number of features.
Upvotes: -1
Reputation: 158
You can use scipy.stats.pearsonr on each of the feature columns like so:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
# example data
df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]],
columns=['Feature1', 'Feature2','Feature3','Target'])
# Only compute pearson prod-moment correlations between feature
# columns and target column
target_col_name = 'Target'
feature_target_corr = {}
for col in df:
if target_col_name != col:
feature_target_corr[col + '_' + target_col_name] = \
pearsonr(df[col], df[target_col_name])[0]
print("Feature-Target Correlations")
print(feature_target_corr)
Upvotes: 2
Reputation: 11232
You could use pandas corr
on each column:
df.drop("Target", axis=1).apply(lambda x: x.corr(df.Target))
Upvotes: 21