Borut Flis
Borut Flis

Reputation: 16375

Calculating correlations and its statistical significance with scipy.stats and pandas

I am using the Pandas library for some data analysis. I am testing correlations between attributes. So I calculated the correlations using the .corr() function of the Pandas library. I also want to calculate the statistical significance of this correlations. I already asked a question here. The Pandas library seems to not have this function.

I was advised to use scipy.stats.

from scipy.stats import pearsonr

pearsonr is the function to compute pearson correlation, which is exactly what .corr() except that it also returns the significance, which is what I am after for.

The pearsonr cannot deal with Na/null values. So I get rid of them using .dropna() . This removes more examples then it should.

In my original csv file there is more words for NA/ null values, I account for this when I open the file:

data = pd.read_csv(player, sep=',', na_values=['Did Not Dress','Did Not Play','Inactive','Not With Team'], index_col=0)

.corr() deals with missing the missing values itself. The question is why does the .dropna() remove too many examples. Some values are 0 or 0.00(percentage), but that should not be excluded for my purpose.

A few lines from the .csv file:

Rk,G,Date,Age,Tm,,Opp,,GS,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,+/-
1,1,2017-10-18,32-091,SAS,,MIN,W (+8),1,38:49,9,21,.429,1,2,.500,6,7,.857,5,5,10,4,0,2,3,2,25,18.9,+15
2,2,2017-10-21,32-094,SAS,@,CHI,W (+10),1,32:46,12,24,.500,0,2,.000,4,4,1.000,5,5,10,3,1,2,1,2,28,23.7,+13
3,3,2017-10-23,32-096,SAS,,TOR,W (+4),1,36:17,7,16,.438,0,1,.000,6,7,.857,3,5,8,3,1,1,2,3,20,15.4,+10
4,4,2017-10-25,32-098,SAS,@,MIA,W (+17),1,38:09,12,20,.600,1,1,1.000,6,7,.857,1,6,7,1,2,1,2,4,31,23.7,+16
5,5,2017-10-27,32-100,SAS,@,ORL,L (-27),1,29:30,9,14,.643,1,2,.500,5,5,1.000,4,7,11,0,0,1,1,0,24,22.4,-20
6,6,2017-10-29,32-102,SAS,@,IND,L (-3),1,36:24,10,21,.476,1,2,.500,5,7,.714,3,5,8,0,0,1,1,3,26,16.6,-15
7,7,2017-10-30,32-103,SAS,@,BOS,L (-14),1,26:00,5,13,.385,0,2,.000,1,5,.200,3,2,5,2,1,1,1,1,11,6.7,-19
8,8,2017-11-02,32-106,SAS,,GSW,L (-20),1,35:54,8,22,.364,2,4,.500,6,8,.750,5,5,10,2,2,2,2,3,24,17.6,-15

Upvotes: 0

Views: 1086

Answers (1)

user9993950
user9993950

Reputation: 141

You may want to extract the two columns between which you want to compute Pearson's correlation coefficient and use Numpy isnan function to remove null values.

x = data.column_1.values
y = data.column_2.values
mask = ~numpy.isnan(x) * ~numpy.isnan(y)
x, y = x[M), y[M]
rvalue, pvalue = scipy.stats.pearsonr(x, y)

Prior to that you could exclude some rows based on what values they have on other columns.

Upvotes: 1

Related Questions