Reputation: 71
I am trying to perform a Wilcoxon rank-sum test between two data frames. I would like to perform the test only between the rows. for example, the test should only be done between row 1 in df1 (A, 1, 2, 3) and df2 (A ,10, 12 ,13), row 2 in df1 (B ,4, 5, 6) and df2 (B ,14, 15, 16), and so on.
df1=pd.DataFrame(np.array([['A',1, 2, 3], ['B',4, 5, 6], ['C',7, 8, 9]]),
columns=['Details','a', 'b', 'c'])
df2=pd.DataFrame(np.array([['A',10, 12, 13], ['B',14, 15, 16], ['C',17, 18, 19]]),
columns=['Details','a', 'b', 'c'])
This should lead me to a column of p values for the test between the rows of the data frames.
out = pd.DataFrame(np.array([['A',0.05], ['B',0.0002], ['C',1]]),
columns=['details','P'])
One way is to apply a for loop but unfortunately, I have 28000 rows in my original dataset and this experiment has to be repeated at least 1000 times. I am wondering if anyone has a better strategy to approach this. Thank you very much for your help in advance.
Upvotes: 0
Views: 1303
Reputation: 562
One way to calculate this is using ranksums of scipy
from scipy.stats import ranksums
import pandas as pd
df1=pd.DataFrame(np.array([['A',1, 2, 3], ['B',4, 5, 6], ['C',7, 8, 9]]),
columns=['Details','a', 'b', 'c'])
df2=pd.DataFrame(np.array([['A',10, 12, 13], ['B',14, 15, 16], ['C',17, 18, 19]]),
columns=['Details','a', 'b', 'c'])
a = df1.loc[0,'a':].values.astype(int) #Select the first row
b = df2.loc[0,'a':].values.astype(int) #Select the second row
ranksums(a, b)
Upvotes: 2