Reputation:
I have a pandas dataframe which stores user id, their salary range(out of 3 possible ranges), and profit they generated as below:
user_id salary_range profit_amount
--------- ------------------ ---------------
123 0 - 35,000 324
654 50,000 - 100,000 2083
129 50,000 - 100,000 20023
654 0 - 35,000 699
398 35,000 - 49,999 298
I would like to see if there is any correlation between a users salary range, and the profit they generate.
Typically I would use a seaborn.heatmap
along with pd.corr
but this only works for 2 numerical variables, and while salary is typically a numerical amount, here the range is a categorical.
Personlly, my method of solving this would be to rank the ranges from 1 to 3, and then generate a correlation from there. However I believe that there are other possible ways to do this, and would like to see if anybody can suggest an alternative correlation method between the range and profit?
Upvotes: 0
Views: 3372
Reputation: 464
I believe correct way to get the association between salary_range
and
profit_amount
would be one way ANOVA.
import pandas as pd
import numpy as np
data = {"user_id":[123,654,129,654,398],
"salary_range":["0 - 35,000","50,000 - 100,000","50,000 - 100,000","0 - 35,000","35,000 - 49,999"],
"profit_amount":[324,2083,20023,699,298]}
df = pd.DataFrame(data)
df
from scipy import stats
F, p = stats.f_oneway(df[df.salary_range=="0 - 35,000"].profit_amount,
df[df.salary_range=="35,000 - 49,999"].profit_amount,
df[df.salary_range=="50,000 - 100,000"].profit_amount)
print("Statistics Values: ",np.round(F,2), "\n","P _Value :",np.round(p,2))
Output:
Statistics Values: 0.84
P _Value : 0.54
If F score is towards 0, then there is no correlation between categorical column and continuous column. That concludes that there no correlation.
Upvotes: 1
Reputation: 51
To calculate the link between a quantitative variable and a qualitative variable you need to calculate Eta
If it can help you for, in R you can use this function : etaSquared()
on an anova
Upvotes: 0