Stephopolis
Stephopolis

Reputation: 1795

Comparing two pandas dataframes on column and the row

I have two pandas dataframes that look about the same but with different information stored in them. My question will be about how to compare the two dataframes to ensure column and row match before performing some analysis and to obtain a third dataframe of the correlation between the two.

df1 (50x14492):

TYPE GENRE1 GENRE2
Name1 .0945 .0845
Name2 .9074 Nan
Name3 1     0

and df2 (50x14492):

TYPE GENRE1 GENRE2
Name1 .9045 .895
Name2 .074  1
Name3 .5    .045

Hoped for result df3 that is as yet unobtained(50x14492):

TYPE GENRE1                  GENRE2
Name1 spearsonr(.9045,.9045) spearsonr(.0845,.895)
Name2 spearsonr(.9074,.074)  spearsonr(Nan,1)
Name3 spearsonr(1,.5)        spearsonr(0,.045)

I'd like to compare df1.GENRE1.Name1 to df2.GENRE1.Name1 but am getting lost in the implementation. In order to do this I have the following code:

for key1, value1 in df1.iteritems():
    for key2, value2 in df2.iteritems():
        if key1 == key2:
           # this gets me to df1.GENRE1 == df2.GENRE1
           for newkey1, newval1 in key1.iterrows():
               for newkey2, newval2 in key2.iterrows():
                   if newkey1 == newkey2:
                      # this does not seem to get me to df1.GENRE1.Name1 == df2.GENRE1.Name1
                      scipy.stats.spearmanr(newval1, newval2)

This is allowing me to compare df1.GENRE1 and df2.GENRE1 but I am not sure how to get to the next logical step of also ensuring that df1.GENRE1.Name1 == df2.GENRE1.Name1. Another way to put it, I am unsure of how to ensure the rows match now that I have the columns.

NOTE: I have tried to use spearmanr on the full two dataframes as such:

corr, p_val = scipy.stats.spearmanr(df1, df2, axis=0, nan_policy='omit')

but rather than getting a new dataframe of the same size (50x14492) I am getting a table back that's 100x100.

Similarly if I use:

corr, p_val = scipy.stats.spearmanr(df1['GENRE1'], df2['GENRE1'], axis=0, nan_policy='omit')

I get the correlation of the two columns as a whole, rather than each row of that column. (Which would be of size 1X14492)

Upvotes: 0

Views: 114

Answers (1)

Brian
Brian

Reputation: 1604

Your question is a bit convoluted. Are you trying to get the correlation between the two Genre columns?

If so you can simply call the correlation on the two columns in the DataFrame:

scipy.stats.spearmanr(df1['GENRE1'], df2['GENRE1'])

After reading your comment and edits, it appears you want the correlation row-wise. That's a simple CS problem but you should know that you're not going to get anything meaningful out of taking the correlation between two values. It'll just be undefined or 1. Anyway, this should populate df3 as you requested above:

df3 = pd.DataFrame()
df3['genre1'] = map(spearmanr, zip(df1['genre1'], df2['genre1']))
df3['genre2'] = map(spearmanr, zip(df1['genre2'], df2['genre2']))

Upvotes: 1

Related Questions