Reputation: 3409
I have the following dataframe:
affecting VA affected VB
A 3 B -1
A 4 B 1
B 10 A 2
B 7 A -2
for each pair of affecting and affected i want to calculate correlation coefficient and P value ( using scipy.stats.pearsonr ) between VA and VB.
resulting dataframe will be something like :
affecting affected correlation_coefficient P_value
A B ... ...
B A ... ...
def calc_coefficient ( a, b):
cor_coef = scipy.stats.pearsonr(a, b)[0]
Pvalue = scipy.stats.pearsonr(a, b)[1]
return pd.Series(dict(correlation_coef=cor_coef, P_value=Pvalue))
uplifts.groupby(["affecting_product","affected_product"])[["UpliftA","UpliftB"]].apply(calc_coefficient)
this solution not working and returns : calc_coefficient() missing 1 required positional argument: 'b'
Upvotes: 0
Views: 739
Reputation: 862406
Pass x
for all groups columns and for a,b
use Series x['VA'], x['VB']
:
def calc_coefficient(x):
Pvalue, cor_coef = scipy.stats.pearsonr(x['VA'], x['VB'])
return pd.Series(dict(correlation_coef=cor_coef, P_value=Pvalue))
mask = df.duplicated(subset=["affecting","affected"], keep=False)
df1 = df[mask].groupby(["affecting","affected"]).apply(calc_coefficient)
print (df1)
correlation_coef P_value
affecting affected
A B 1.0 1.0
B A 1.0 1.0
Upvotes: 1