Reputation: 11
I know it's probably obvious how to solve it, but I am out of ideas...
I import a .csv file with Pandas into a dataframe. The data has the format: 3 columns with single headers, 1st column: 45 rows, 2nd column 40 rows, 3rd column: 21 rows. The shape is then (45,3). The "missing" rows are filled with NANs and here starts my problem.
I want to evaluate some statistics data with different scipy functions like the Anderson Darling test etc., like this:
for i in columns:
print ([i])
a = stats.anderson(df[i], dist = 'norm')
print (a)
if a[0] > a[1][2]:
print('The null hypothesis can be rejected at', a[2][2],'% significance level')
else:
print('The null hypothesis cannot be rejected')
So, the first column gets evaluated just fine:
['Z79V0001']AndersonResult(statistic=0.41768739435435975, critical_values=array([0.535, 0.609, 0.731, 0.853, 1.014]), significance_level=array([15. , 10. , 5. , 2.5, 1. ]))The null hypothesis cannot be rejected
but for the others I get something like
['Z79V0003_1']AndersonResult(statistic=nan, critical_values=array([0.535, 0.609, 0.731, 0.853, 1.014]), significance_level=array([15. , 10. , 5. , 2.5, 1. ]))
The null hypothesis cannot be rejected Filling the NAN values with zeros does not help because then the statistics will be calculated the wrong way. I simply cannot get around how to adjust the lengths of the columns so that the functions just works on the rows where it finds numbers and if gets to NAN goes on with next column... Help would be very much appreciated.
Upvotes: 1
Views: 101
Reputation: 68186
This will be easiest if you pass numpy arrays to the stats function. You can use Series methods of each column to drop the NaNs:
for col in df.columns:
a = stats.anderson(df[col].dropna().values, dist = 'norm')
Upvotes: 1