scipy stats correlation to data frame with column names

Question

When scipy.stats.spearmanr(df1,df2) is used on two data frames you can unstack the output values to get 2 numpy arrays of correlation and p-values. How do you put them back into a data frame with the same column names as the original data frames?

correlation, p_value = scipy.stats.spearmanr(df1, df2)

yields 2 numpy arrays but would like to get data frame with correlation and p_value next to each column for x columns as in:

not obvious...here is some code to get random data (suggested by ouroboros1)

import scipy.stats
import numpy as np
import pandas as pd
# Create array of 5 rows and 3 columns,
# filled with random values from 1 to 10 
data = np.random.randint(1,10,size=(5,3))
# Create Dataframe with data
df1 = pd.DataFrame(data, columns=['Col_1','Col_2','Col_3'])
# Set target column in new data frame
df2 = df1['Col_3'].to_frame().copy()
df1.drop(['Col_3'], axis=1, inplace=True)
# Obtain correlation coefficient for each 
# value in column   against value in every other column
correlation, p_value = scipy.stats.spearmanr(df1, df2)

(edited) I think I found a way but there must be a shorter path:

# Concat frames that contain features with frame that
# contains target
correlation_frame = pd.concat([df1, df2], axis=1)
col_cor_list = list(correlation_frame)
# Create new frame containing all correlation coefficients
cor = pd.DataFrame(correlation, columns=col_list)
cor
    Col_1     Col_2        Col_3
0   1.000000    -0.883883   -0.229416
1   -0.883883   1.000000    0.081111
2   -0.229416   0.081111    1.000000

# Get a dataframe for p-values
col_p_list = ['p_value' + str(x) for x in
              range(1,len(p_value)+1)]
p_frame = pd.DataFrame(p_value, columns=col_p_list)
p_frame

    p_value1        p_value2    p_value3
0   1.404265e-24    0.046662    0.710482
1   4.666188e-02    0.000000    0.896840
2   7.104817e-01    0.896840    0.000000

# Combine values by alternating column names
# so p-values are placed correctly
alter_names = (list(sum(zip(col_cor_list, col_p_list), ())))
final = cor.join(p_frame)
results = final[alter_names]
results

   Col_1            p_value1       Col_2    p_value2      Col_3 p_value3
0   1.000000    1.404265e-24    -0.883883   0.046662 -0.229416  0.710482
1   -0.883883   4.666188e-02    1.000000    0.000000    0.081111    0.896840
2   -0.229416   7.104817e-01    0.081111    0.896840    1.000000    0.000000

ouroboros1 · Accepted Answer

Suggested refactored version:

Get correlation and p_value

import scipy.stats
import numpy as np
import pandas as pd

n = 3
data = np.random.randint(1,10,size=(5,n))
# create cols `Col_1`, `_2` etc.
cols = [f'Col_{i+1}' for i in range(n)]
df1 = pd.DataFrame(data, columns=cols)

# create `df2` based on index, and drop
df2 = df1.iloc[:,-1].to_frame().copy()
df1 = df1.iloc[:,:-1]
correlation, p_value = scipy.stats.spearmanr(df1, df2)

Turn into df

Use np.hstack to stack correlation and p_value in sequence horizontally.
Use sorted to sort the columns. Key consists of consecutive numbers at end of each col name, which we extract by using rsplit.

res = pd.DataFrame(np.hstack((correlation,p_value)), columns=cols
                   +[f'p_value_{i+1}' for i in range(n)], index=cols)

res = res.loc[:,sorted(res.columns, 
                       key=lambda x: int(x.rsplit('_', maxsplit=1)[1]))]

print(res)

          Col_1  p_value_1     Col_2     p_value_2     Col_3  p_value_3
Col_1  1.000000   0.000000  0.410391  4.925358e-01  0.648886   0.236154
Col_2  0.410391   0.492536  1.000000  1.404265e-24 -0.263523   0.668397
Col_3  0.648886   0.236154 -0.263523  6.683968e-01  1.000000   0.000000

So, the only variable here is n. E.g. for n = 5, it will produce:

          Col_1     p_value_1     Col_2  ...  p_value_4     Col_5  p_value_5
Col_1  1.000000  1.404265e-24 -0.461690  ...   0.552815  0.307794   0.614384
Col_2 -0.461690  4.337662e-01  1.000000  ...   0.334035 -0.763158   0.133339
Col_3  0.666886  2.188940e-01 -0.263158  ...   0.362245  0.500000   0.391002
Col_4  0.359092  5.528147e-01  0.552632  ...   0.000000 -0.157895   0.799801
Col_5  0.307794  6.143840e-01 -0.763158  ...   0.799801  1.000000   0.000000

scipy stats correlation to data frame with column names

Answers (1)

Related Questions