Reputation: 1687
How do I do an F-test to check if the variance is equivalent in two vectors in Python?
For example if I have
a = [1,2,1,2,1,2,1,2,1,2]
b = [1,3,-1,2,1,5,-1,6,-1,2]
is there something similar to
scipy.stats.ttest_ind(a, b)
I found
sp.stats.f(a, b)
But it appears to be something different to an F-test
Upvotes: 48
Views: 98254
Reputation: 24788
The test statistic F test for equal variances is simply:
F = Var(X) / Var(Y)
Where F
is distributed as df1 = len(X) - 1, df2 = len(Y) - 1
scipy.stats.f
which you mentioned in your question has a CDF method. This means you can generate a p-value for the given statistic and test whether that p-value is greater than your chosen alpha level.
Thus:
alpha = 0.05 #Or whatever you want your alpha to be.
p_value = 1 - scipy.stats.f.cdf(F, df1, df2)
Alternatively, the p-value can be calculated using the sf-method of the f-test:
p_value = scipy.stats.f.sf(F, df1, df2)
if p_value > alpha:
# Reject the null hypothesis that Var(X) == Var(Y)
Note that the F-test is extremely sensitive to non-normality of X and Y, so you're probably better off doing a more robust test such as Levene's test or Bartlett's test unless you're reasonably sure that X and Y are distributed normally. These tests can be found in the scipy
api:
Upvotes: 60
Reputation: 9660
Here is a simple function to calculate the one-sided or two-sided F test with Python and SciPy. The results have been checked against the output of the var.test()
function in R. Please keep in mind the warnings mentioned in the other answers concerning the sensitivity of the F-test to non-normality.
import scipy.stats as st
def f_test(x, y, alt="two_sided"):
"""
Calculates the F-test.
:param x: The first group of data
:param y: The second group of data
:param alt: The alternative hypothesis, one of "two_sided" (default), "greater" or "less"
:return: a tuple with the F statistic value and the p-value.
"""
df1 = len(x) - 1
df2 = len(y) - 1
f = x.var() / y.var()
if alt == "greater":
p = 1.0 - st.f.cdf(f, df1, df2)
elif alt == "less":
p = st.f.cdf(f, df1, df2)
else:
# two-sided by default
# Crawley, the R book, p.355
p = 2.0*(1.0 - st.f.cdf(f, df1, df2))
return f, p
Upvotes: 7
Reputation: 179
if you need a two-tailed test, you can proceed as follow, i choosed alpha =0.05:
a = [1,2,1,2,1,2,1,2,1,2]
b = [1,3,-1,2,1,5,-1,6,-1,2]
print('Variance a={0:.3f}, Variance b={1:.3f}'.format(np.var(a, ddof=1), np.var(b, ddof=1)))
fstatistics = np.var(a, ddof=1)/np.var(b, ddof=1) # because we estimate mean from data
fdistribution = stats.f(len(a)-1,len(b)-1) # build an F-distribution object
p_value = 2*min(fdistribution.cdf(f_critical), 1-fdistribution.cdf(f_critical))
f_critical1 = fdistribution.ppf(0.025)
f_critical2 = fdistribution.ppf(0.975)
print(fstatistics,f_critical1, f_critical2 )
if (p_value<0.05):
print('Reject H0', p_value)
else:
print('Cant Reject H0', p_value)
if you want to proceed to an ANOVA like test where only large values can cause rejection, you can proceed to right-tail test, you need to pay attention to the order of variances (fstatistics = var1/var2 or var2/var1):
a = [1,2,1,2,1,2,1,2,1,2]
b = [1,3,-1,2,1,5,-1,6,-1,2]
print('Variance a={0:.3f}, Variance b={1:.3f}'.format(np.var(a, ddof=1), np.var(b, ddof=1)))
fstatistics = max(np.var(a, ddof=1), np.var(b, ddof=1))/min(np.var(a, ddof=1), np.var(b, ddof=1)) # because we estimate mean from data
fdistribution = stats.f(len(a)-1,len(b)-1) # build an F-distribution object
p_value = 1-fdistribution.cdf(fstatistics)
f_critical = fd.ppf(0.95)
print(fstatistics, f_critical)
if (p_value<0.05):
print('Reject H0', p_value)
else:
print('Cant Reject H0', p_value)
The left-tailed can be done as follow :
a = [1,2,1,2,1,2,1,2,1,2]
b = [1,3,-1,2,1,5,-1,6,-1,2]
print('Variance a={0:.3f}, Variance b={1:.3f}'.format(np.var(a, ddof=1), np.var(b, ddof=1)))
fstatistics = min(np.var(a, ddof=1), np.var(b, ddof=1))/max(np.var(a, ddof=1), np.var(b, ddof=1)) # because we estimate mean from data
fdistribution = stats.f(len(a)-1,len(b)-1) # build an F-distribution object
p_value = fdistribution.cdf(fstatistics)
f_critical = fd.ppf(0.05)
print(fstatistics, f_critical)
if (p_value<0.05):
print('Reject H0', p_value)
else:
print('Cant Reject H0', p_value)
Upvotes: 4
Reputation: 2052
To do a one way anova you can use
import scipy.stats as stats
stats.f_oneway(a,b)
One way Anova checks if the variance between the groups is greater then the variance within groups, and computes the probability of observing this variance ratio using F-distribution. A good tutorial can be found here:
Upvotes: 10
Reputation: 3357
For anyone who came here searching for an ANOVA F-test or to compare between models for feature selection
sklearn.feature_selection.f_classif
does ANOVA tests, andsklearn.feature_selection.f_regression
does sequential testing of regressionsUpvotes: 13