user13505457
user13505457

Reputation:

How to find spearman's correlation in python for only specific values?

i HAVE A DATAMATRIX of five columns

 0     1     2     3     4 

nan    34    23    34     11

43    34   123     4     44

45    12     4   nan     66

89    78    43   435     23

nan   89   nan   12     687

 6    232    34    4     nan

24    56    34   121     56

nan    9    nan   54     12

 24   nan    54    12    nan

 76    11   123    76     78

 43   nan    65    23     89

 68   233    34   nan     89

 65    53    nan    7     78

 34    65     12    8     12

 56    98     43    nan   43

I also have a fvector

fvector
23

67

23

nan

nan

87

323

nan

78

32

78

112

nan

56

nan

56

Till now i had just been able to find the correlation based upon full column

for i in datamatrix:
    coef,p=spearmanr(datamatrix[i],fvector)
    print(coef,p,"for column ",i)

I want to achieve 2 things:

1). I want to find the spearman's correlation between fvector and each column of datamatrix but if one of the two variables or both variables are nan then i want to drop the correlation for particular pair. for eg. 4th value in column 1 is 78 and 4th value in fvector is nan so i want to exclude the particular pair(not whole column) from the process of correlation.i don't have any idea how to work with specific variable for finding correlation.

2). if the total number of nan values in fvector and datamatrix's column are > 30% then exclude whole column from finding correlation.

Any resource or reference will be helpful

Thanks

Upvotes: 1

Views: 420

Answers (1)

DavideBrex
DavideBrex

Reputation: 2424

1) If you set nan_policy == "omit" the Nan will be ignored in the calculation. See scipy.stats.spearmanr.

2) You can compute the percentage of Nan in each column in this way: (df[i].isna().sum()*100)/df.shape[0]

All together:

nan_fvectr = int(vector.isna().sum())
for i in df:
    if ((df[i].isna().sum()+nan_fvectr)*100)/(df.shape[0]*2) >= 30:
        continue
    coef,p=stats.spearmanr(df[i],vector, nan_policy="omit")
    print(coef,p,"for column ",i)

Upvotes: 1

Related Questions