Reputation: 88
I am using the following code to perform t-test:
def t_stat(na,abar,avar,nb,bbar,bvar):
logger.info("T-test to be performed")
logger.info("Set A count = %f mean = %f variance = %f" % (na,abar,avar))
logger.info("Set B count = %f mean = %f variance = %f" % (nb,bbar,bvar))
adof = na - 1
bdof = nb - 1
logger.info("Degrees of Freedom of a=%f" % adof)
logger.info("Degrees of Freedom of b=%f" % bdof)
tf = (abar - bbar) / np.sqrt(avar/na + bvar/nb)
dof = (avar/na + bvar/nb)**2 / (avar**2/(na**2*adof) + bvar**2/(nb**2*bdof))
logger.info("tf = %f, dof=%f"%(tf,dof))
pf = 2*stdtr(dof, -np.abs(tf))
My output looks like :
Set A count = 3547465.000000 mean = 0.001123 variance = 0.000369
Set B count = 83759692.000000 mean = 0.001242 variance = 0.000424
Degrees of Freedom of a=3547464.000000
Degrees of Freedom of b=83759691.000000
tf = -11.374250, dof=-2176568.362223
formula: t = -11.3743 p = nan
When I pass the same data as arrays and use the ttest_ind function, I get t = -11.374250 p = 0.000000.
Why is my function giving p as nan ? Afaik, I cannot treat nan as 0. How can I understand the exact difference between my t_stat and ttest_ind ? Any help would be appreciated.
Upvotes: 1
Views: 1631
Reputation: 3947
The degrees of freedom you are passing to the formula are negative.
In [6]:
import numpy as np
from scipy.special import stdtr
dof = -2176568
tf = -11.374250
2*stdtr(dof, -np.abs(tf))
Out[6]:
nan
If positive:
In [7]:
import numpy as np
from scipy.special import stdtr
dof = 2176568
tf = -11.374250
2*stdtr(dof, -np.abs(tf))
Out[7]:
5.6293517178917971e-30
I wonder how it has happened in your case, I've run your code trying to infer the input parameters:
In [13]:
def t_stat(na,abar,avar,nb,bbar,bvar):
print("T-test to be performed")
print("Set A count = %f mean = %f variance = %f" % (na,abar,avar))
print("Set B count = %f mean = %f variance = %f" % (nb,bbar,bvar))
adof = na - 1
bdof = nb - 1
print("Degrees of Freedom of a=%f" % adof)
print("Degrees of Freedom of b=%f" % bdof)
tf = (abar - bbar) / np.sqrt(avar/na + bvar/nb)
dof = (avar/na + bvar/nb)**2 / (avar**2/(na**2*adof) + bvar**2/(nb**2*bdof))
print("tf = %f, dof=%f"%(tf,dof))
print(stdtr(dof, -np.abs(tf)))
In [14]:
t_stat(3547465, 0.001123, 0.000369, 83759692, 0.001242, 0.000424)
T-test to be performed
Set A count = 3547465.000000 mean = 0.001123 variance = 0.000369
Set B count = 83759692.000000 mean = 0.001242 variance = 0.000424
Degrees of Freedom of a=3547464.000000
Degrees of Freedom of b=83759691.000000
tf = -11.393950, dof=3900753.641275
2.2434573594e-30
Hope it helps you to find the problem anyway.
Upvotes: 1