Reputation: 99
If you want to get the correlation between all the variables in a dataframe you can just do:
c = df_train.corr().abs()
s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so)
However, in this case I have 90 variables, plus I already know my dependent variable ("sales"). How can I get the ordered correlation values between "sales" and the rest of the variables.
Upvotes: 2
Views: 2893
Reputation: 31329
Are you asking because you just want to select the correlations involving sales
or are you asking because you want to prevent pandas
from even computing the correlations you don't need (for performance reasons for example)? For the former, you can just select the sales
series from the .corr()
result - for the latter, you probably want corrwith()
import pandas as pd
df = pd.DataFrame([
{'name': 'john', 'age': 21, 'beard_len': 3, 'wrinkle_index': .02},
{'name': 'pete', 'age': 83, 'beard_len': 21, 'wrinkle_index': .87},
{'name': 'will', 'age': 54, 'beard_len': 13, 'wrinkle_index': .52},
{'name': 'mike', 'age': 32, 'beard_len': 11, 'wrinkle_index': .24},
{'name': 'greg', 'age': 61, 'beard_len': 14, 'wrinkle_index': .49},
])
print(df.corr()['age'])
print(df.corrwith(df['age']))
Output:
age 1.000000
beard_len 0.951520
wrinkle_index 0.986042
Name: age, dtype: float64
age 1.000000
beard_len 0.951520
wrinkle_index 0.986042
dtype: float64
Upvotes: 4
Reputation: 399
Once you have every correlation you can filter those involving the variable "sales"
sales_correlations = so.sales
Get the correlation of the whole dataframe with the "sales" series
c = df_train.corrwith(df_train.sales)
Upvotes: 2