Reputation:
i HAVE A DATAMATRIX of five columns
0 1 2 3 4
nan 34 23 34 11
43 34 123 4 44
45 12 4 nan 66
89 78 43 435 23
nan 89 nan 12 687
6 232 34 4 nan
24 56 34 121 56
nan 9 nan 54 12
24 nan 54 12 nan
76 11 123 76 78
43 nan 65 23 89
68 233 34 nan 89
65 53 nan 7 78
34 65 12 8 12
56 98 43 nan 43
I also have a fvector
fvector
23
67
23
nan
nan
87
323
nan
78
32
78
112
nan
56
nan
56
Till now i had just been able to find the correlation based upon full column
for i in datamatrix:
coef,p=spearmanr(datamatrix[i],fvector)
print(coef,p,"for column ",i)
I want to achieve 2 things:
1). I want to find the spearman's correlation between fvector and each column of datamatrix but if one of the two variables or both variables are nan then i want to drop the correlation for particular pair. for eg. 4th value in column 1 is 78 and 4th value in fvector is nan so i want to exclude the particular pair(not whole column) from the process of correlation.i don't have any idea how to work with specific variable for finding correlation.
2). if the total number of nan values in fvector and datamatrix's column are > 30% then exclude whole column from finding correlation.
Any resource or reference will be helpful
Thanks
Upvotes: 1
Views: 420
Reputation: 2424
1) If you set nan_policy == "omit"
the Nan will be ignored in the calculation. See scipy.stats.spearmanr.
2) You can compute the percentage of Nan in each column in this way: (df[i].isna().sum()*100)/df.shape[0]
All together:
nan_fvectr = int(vector.isna().sum())
for i in df:
if ((df[i].isna().sum()+nan_fvectr)*100)/(df.shape[0]*2) >= 30:
continue
coef,p=stats.spearmanr(df[i],vector, nan_policy="omit")
print(coef,p,"for column ",i)
Upvotes: 1