Reputation:
I have a set of columns (col1,col2,col3) in dataframe df1 I have another set of columns (col4,col5,col6) in dataframe df2 Assume this two dataframes has the same number of rows.
How do I generate a correlation table that do pairwise correlation between df1 and df2?
the table will look like
col1 col2 col3
col4 .. .. ..
col5 .. .. ..
col6 .. .. ..
I use df1.corrwith(df2)
, it does not seem to generate the table as required.
I have a asked a similar question here: How to perform Correlation between two dataframes with different column names but now I am dealing with categorical columns.
If it is not comparable directly, is there a standard way to make it comparable (like using get_dummies)? and is that a faster way to automatically process all fields (assume all are categorical) and calculate their correlation?
Upvotes: 6
Views: 13019
Reputation: 62037
You are correct that pd.get_dummies
would be needed to get the correlation. Below, I will create some fake data with two categorical columns and then use corrwith
df = pd.DataFrame({'col1':np.random.choice(list('abcde'),100),
'col2':np.random.choice(list('xyz'),100)}, dtype='category')
df1 = pd.DataFrame({'col1':np.random.choice(list('abcde'),100),
'col2':np.random.choice(list('xyz'),100)}, dtype='category')
dfa = pd.get_dummies(df)
dfb = pd.get_dummies(df1)
dfa.corrwith(dfb)
col1_a -0.057735
col1_b 0.002513
col1_c 0.137956
col1_d -0.095050
col1_e -0.114022
col2_x 0.022568
col2_y -0.081699
col2_z -0.128350
Upvotes: 5