PassaroBagante
PassaroBagante

Reputation: 99

Get the correlation between one variable and all the rest of the dataset

If you want to get the correlation between all the variables in a dataframe you can just do:

c = df_train.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print(so)

However, in this case I have 90 variables, plus I already know my dependent variable ("sales"). How can I get the ordered correlation values between "sales" and the rest of the variables.

Upvotes: 2

Views: 2893

Answers (2)

Grismar
Grismar

Reputation: 31329

Are you asking because you just want to select the correlations involving sales or are you asking because you want to prevent pandas from even computing the correlations you don't need (for performance reasons for example)? For the former, you can just select the sales series from the .corr() result - for the latter, you probably want corrwith()

import pandas as pd

df = pd.DataFrame([
    {'name': 'john', 'age': 21, 'beard_len': 3, 'wrinkle_index': .02},
    {'name': 'pete', 'age': 83, 'beard_len': 21, 'wrinkle_index': .87},
    {'name': 'will', 'age': 54, 'beard_len': 13, 'wrinkle_index': .52},
    {'name': 'mike', 'age': 32, 'beard_len': 11, 'wrinkle_index': .24},
    {'name': 'greg', 'age': 61, 'beard_len': 14, 'wrinkle_index': .49},
])

print(df.corr()['age'])
print(df.corrwith(df['age']))

Output:

age              1.000000
beard_len        0.951520
wrinkle_index    0.986042
Name: age, dtype: float64
age              1.000000
beard_len        0.951520
wrinkle_index    0.986042
dtype: float64

Upvotes: 4

itogaston
itogaston

Reputation: 399

Option 1

Once you have every correlation you can filter those involving the variable "sales"

sales_correlations = so.sales

Option 2

Get the correlation of the whole dataframe with the "sales" series

c = df_train.corrwith(df_train.sales)

Upvotes: 2

Related Questions