Reputation: 605
I have a dataframe in which I am trying to apply T-test on each row but its giving me nan.
Code:
from scipy.stats import ttest_ind, ttest_rel
import pandas as pd
df_stat = df_stat[['day', 'hour', 'CallerObjectId', 'signals_normalized', 'presence_normalized']]
def ttest(a, b):
t = ttest_ind(a, b)
return t
df_stat['ttest'] = df_stat.apply(lambda row: ttest(row['presence_normalized'], row['signals_normalized']), axis=1)
print(df_stat)
Output:
day hour CallerObjectId signals_normalized presence_normalized ttest
0 2021-04-04 9 287b19b7-32ce-4617-94b1-57a632f6f147 0.062500 0.514461 (nan, nan)
1 2021-04-04 16 287b19b7-32ce-4617-94b1-57a632f6f147 0.187500 1.000000 (nan, nan)
2 2021-04-04 17 287b19b7-32ce-4617-94b1-57a632f6f147 0.187500 0.895121 (nan, nan)
3 2021-04-04 18 287b19b7-32ce-4617-94b1-57a632f6f147 0.062500 0.608823 (nan, nan)
4 2021-04-04 19 287b19b7-32ce-4617-94b1-57a632f6f147 1.000000 0.716623 (nan, nan)
5 2021-04-04 20 287b19b7-32ce-4617-94b1-57a632f6f147 0.062500 0.314928 (nan, nan)
Upvotes: 2
Views: 618
Reputation: 4186
A T-test is done to compare two distributions, you're using it to compare two single values.
Internally, a T-test divides by the variances of the distributions, the variance of a single sample is 0. So by doing a T-test on two individual values, you're dividing by zero. (See wikipedia)
The values you're doing a T-test on appear to be aggregated per hour, you should probably do a T-test on the values without aggregating them per hour. Or you could do a T-test on each day for of your current values.
Upvotes: 1