ishido
ishido

Reputation: 4255

How to calculate Spearman's rank correlation between two datasets

If we have:

X = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
Y = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})

How do we calculate Spearman's Rank Correlation between the two datasets (but not within each dataset), so that in the end we have a 5x5 matrix? Like this:

    A  B  C  D  E
A   .  .  .  .  .
B   .  .  .  .  .
C   .  .  .  .  .
D   .  .  .  .  .
E   .  .  .  .  .

Upvotes: 1

Views: 4126

Answers (2)

Nils Gudat
Nils Gudat

Reputation: 13800

Using pandas' concat and corr function you can turn this into a one liner by putting everything together into one DataFrame:

import pandas as pd

X = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
Y = pd.DataFrame({"A1":[45,24,65,65,65], "B1":[45,87,65,52,12], "C1":[98,52,32,32,12], "D1":[0,23,1,365,53], "E1":[24,12,65,3,65]})

pd.concat([X,Y], axis=1).corr(method="spearman").iloc[5:,:5]

Note that in my example I gave the second set of columns a different name to make them more easily distinguishable. Using pandas' indexing features you could come up with a more sophisticated way of picking out the desired rows/columns from the correlation table than my .iloc[5:,:5], but in this case it works.


EDIT TO ADD RESULTS:

enter image description here

Upvotes: 3

Zafi
Zafi

Reputation: 629

This should do the trick! Probably might be made shorter though:

import pandas as pd
import numpy as np
from scipy.stats import linregress


X = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
Y = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})

row = 0
col = 0
m = np.zeros( (len(X), len(Y) ))
for key_x, val_x in X.iteritems():
    for key_y, val_y in Y.iteritems():
        if( col == 5 ):
            col = 0 
        m[row][col] = linregress(val_x, val_y).rvalue
        col += 1
    row += 1

print m

To calculate the correlation, I am using linregress, but there are other alternatives such as:

  • numpy.corrcoef
  • pandas.DataFrame.corr

And probably some others too ;)

Upvotes: 0

Related Questions