user5768866
user5768866

Reputation:

How to perform correlation between categorical columns

I have a set of columns (col1,col2,col3) in dataframe df1 I have another set of columns (col4,col5,col6) in dataframe df2 Assume this two dataframes has the same number of rows.

How do I generate a correlation table that do pairwise correlation between df1 and df2?

the table will look like

    col1 col2 col3
col4 ..   ..   ..
col5 ..   ..   ..
col6 ..   ..   ..

I use df1.corrwith(df2), it does not seem to generate the table as required.

I have a asked a similar question here: How to perform Correlation between two dataframes with different column names but now I am dealing with categorical columns.

If it is not comparable directly, is there a standard way to make it comparable (like using get_dummies)? and is that a faster way to automatically process all fields (assume all are categorical) and calculate their correlation?

Upvotes: 6

Views: 13019

Answers (1)

Ted Petrou
Ted Petrou

Reputation: 62037

You are correct that pd.get_dummies would be needed to get the correlation. Below, I will create some fake data with two categorical columns and then use corrwith

df = pd.DataFrame({'col1':np.random.choice(list('abcde'),100),
                  'col2':np.random.choice(list('xyz'),100)}, dtype='category')
df1 = pd.DataFrame({'col1':np.random.choice(list('abcde'),100),
                   'col2':np.random.choice(list('xyz'),100)}, dtype='category')

dfa = pd.get_dummies(df)
dfb = pd.get_dummies(df1)
dfa.corrwith(dfb)

col1_a   -0.057735
col1_b    0.002513
col1_c    0.137956
col1_d   -0.095050
col1_e   -0.114022
col2_x    0.022568
col2_y   -0.081699
col2_z   -0.128350

Upvotes: 5

Related Questions