Get the correlation between one variable and all the rest of the dataset

Question

If you want to get the correlation between all the variables in a dataframe you can just do:

c = df_train.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print(so)

However, in this case I have 90 variables, plus I already know my dependent variable ("sales"). How can I get the ordered correlation values between "sales" and the rest of the variables.

Grismar · Accepted Answer

Are you asking because you just want to select the correlations involving sales or are you asking because you want to prevent pandas from even computing the correlations you don't need (for performance reasons for example)? For the former, you can just select the sales series from the .corr() result - for the latter, you probably want corrwith()

import pandas as pd

df = pd.DataFrame([
    {'name': 'john', 'age': 21, 'beard_len': 3, 'wrinkle_index': .02},
    {'name': 'pete', 'age': 83, 'beard_len': 21, 'wrinkle_index': .87},
    {'name': 'will', 'age': 54, 'beard_len': 13, 'wrinkle_index': .52},
    {'name': 'mike', 'age': 32, 'beard_len': 11, 'wrinkle_index': .24},
    {'name': 'greg', 'age': 61, 'beard_len': 14, 'wrinkle_index': .49},
])

print(df.corr()['age'])
print(df.corrwith(df['age']))

Output:

age              1.000000
beard_len        0.951520
wrinkle_index    0.986042
Name: age, dtype: float64
age              1.000000
beard_len        0.951520
wrinkle_index    0.986042
dtype: float64

Get the correlation between one variable and all the rest of the dataset

Answers (2)

Option 1

Option 2

Related Questions