Cox Tox
Cox Tox

Reputation: 691

Compute correlation between features and target variable

What is the best solution to compute correlation between my features and target variable ?? My dataframe have 1000 rows and 40 000 columns...

Exemple :

df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]], columns=['Feature1', 'Feature2','Feature3','Target'])

This code works fine but this is too long on my dataframe ... I need only the last column of correlation matrix : correlation with target (not pairwise feature corelation).

corr_matrix=df.corr()
corr_matrix["Target"].sort_values(ascending=False)

The np.corcoeff() function works with array but can we exclude the pairwise feature correlation ?

Upvotes: 13

Views: 51510

Answers (4)

Martin Valgur
Martin Valgur

Reputation: 6302

Since Pandas 0.24 released in January 2019, you can simply use DataFrame.corrwith():

df.corrwith(df["Target"])

Upvotes: 21

rohit jalheira
rohit jalheira

Reputation: 7

df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]], columns=['Feature1', 'Feature2','Feature3','Target'])

For correlation between your target variable and all other features:

df.corr()['Target']

This works in my case. Let me know if any corrections/updates on the same.

To get any conclusive results your instance should be atleast 10 times your number of features.

Upvotes: -1

William Gurecky
William Gurecky

Reputation: 158

You can use scipy.stats.pearsonr on each of the feature columns like so:

import pandas as pd
import numpy as np
from scipy.stats import pearsonr

# example data
df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]],
                  columns=['Feature1', 'Feature2','Feature3','Target'])

# Only compute pearson prod-moment correlations between feature
# columns and target column
target_col_name = 'Target'
feature_target_corr = {}
for col in df:
    if target_col_name != col:
        feature_target_corr[col + '_' + target_col_name] = \
            pearsonr(df[col], df[target_col_name])[0]
print("Feature-Target Correlations")
print(feature_target_corr)

Upvotes: 2

w-m
w-m

Reputation: 11232

You could use pandas corr on each column:

df.drop("Target", axis=1).apply(lambda x: x.corr(df.Target))

Upvotes: 21

Related Questions