user13846418
user13846418

Reputation:

Best way to see correlation between a categorical variable and numerical variable in python,

I have a pandas dataframe which stores user id, their salary range(out of 3 possible ranges), and profit they generated as below:

  user_id     salary_range     profit_amount  
 --------- ------------------ --------------- 
      123   0 - 35,000                   324  
      654   50,000 - 100,000            2083  
      129   50,000 - 100,000           20023  
      654   0 - 35,000                   699  
      398   35,000 - 49,999              298  

I would like to see if there is any correlation between a users salary range, and the profit they generate.

Typically I would use a seaborn.heatmap along with pd.corr but this only works for 2 numerical variables, and while salary is typically a numerical amount, here the range is a categorical.

Personlly, my method of solving this would be to rank the ranges from 1 to 3, and then generate a correlation from there. However I believe that there are other possible ways to do this, and would like to see if anybody can suggest an alternative correlation method between the range and profit?

Upvotes: 0

Views: 3372

Answers (2)

RakeshV
RakeshV

Reputation: 464

I believe correct way to get the association between salary_range and profit_amount would be one way ANOVA.

import pandas as pd
import numpy as np

data = {"user_id":[123,654,129,654,398],
    "salary_range":["0 - 35,000","50,000 - 100,000","50,000 - 100,000","0 - 35,000","35,000 - 49,999"],
    "profit_amount":[324,2083,20023,699,298]}

df = pd.DataFrame(data)
df

from scipy import stats
F, p = stats.f_oneway(df[df.salary_range=="0 - 35,000"].profit_amount,
                  df[df.salary_range=="35,000 - 49,999"].profit_amount,
                  df[df.salary_range=="50,000 - 100,000"].profit_amount)
print("Statistics Values: ",np.round(F,2), "\n","P _Value        :",np.round(p,2))

Output:

Statistics Values:  0.84                                    
P _Value        : 0.54

If F score is towards 0, then there is no correlation between categorical column and continuous column. That concludes that there no correlation.

Upvotes: 1

Robin
Robin

Reputation: 51

To calculate the link between a quantitative variable and a qualitative variable you need to calculate Eta

If it can help you for, in R you can use this function : etaSquared() on an anova

Upvotes: 0

Related Questions