chessosapiens
chessosapiens

Reputation: 3409

calculate the correlation between two columns in pandas for each pair

I have the following dataframe:

 affecting VA  affected VB
  A        3   B       -1
  A        4   B        1
  B        10  A         2
  B        7   A        -2

for each pair of affecting and affected i want to calculate correlation coefficient and P value ( using scipy.stats.pearsonr ) between VA and VB.

resulting dataframe will be something like :

 affecting  affected correlation_coefficient  P_value
   A           B           ...                  ...
   B           A           ...                  ...

 def calc_coefficient ( a, b):
     cor_coef = scipy.stats.pearsonr(a, b)[0]
     Pvalue = scipy.stats.pearsonr(a, b)[1]
     return pd.Series(dict(correlation_coef=cor_coef, P_value=Pvalue))

 uplifts.groupby(["affecting_product","affected_product"])[["UpliftA","UpliftB"]].apply(calc_coefficient)

this solution not working and returns : calc_coefficient() missing 1 required positional argument: 'b'

Upvotes: 0

Views: 739

Answers (1)

jezrael
jezrael

Reputation: 862406

Pass x for all groups columns and for a,b use Series x['VA'], x['VB']:

def calc_coefficient(x):
    Pvalue, cor_coef = scipy.stats.pearsonr(x['VA'], x['VB'])
    return pd.Series(dict(correlation_coef=cor_coef, P_value=Pvalue))

mask = df.duplicated(subset=["affecting","affected"], keep=False)
df1 = df[mask].groupby(["affecting","affected"]).apply(calc_coefficient)

print (df1)
                    correlation_coef  P_value
affecting affected                           
A         B                      1.0      1.0
B         A                      1.0      1.0

Upvotes: 1

Related Questions