Reputation: 1687

How do I do a F-test in python

How do I do an F-test to check if the variance is equivalent in two vectors in Python?

For example if I have

a = [1,2,1,2,1,2,1,2,1,2]
b = [1,3,-1,2,1,5,-1,6,-1,2]

is there something similar to

scipy.stats.ttest_ind(a, b)

I found

sp.stats.f(a, b)

But it appears to be something different to an F-test

Upvotes: 48

Answers (5)

Joel Cornett

Reputation: 24788

The test statistic F test for equal variances is simply:

F = Var(X) / Var(Y)

Where F is distributed as df1 = len(X) - 1, df2 = len(Y) - 1

scipy.stats.f which you mentioned in your question has a CDF method. This means you can generate a p-value for the given statistic and test whether that p-value is greater than your chosen alpha level.

Thus:

alpha = 0.05 #Or whatever you want your alpha to be.
p_value = 1 - scipy.stats.f.cdf(F, df1, df2)

Alternatively, the p-value can be calculated using the sf-method of the f-test:

p_value = scipy.stats.f.sf(F, df1, df2)
if p_value > alpha:
    # Reject the null hypothesis that Var(X) == Var(Y)

Note that the F-test is extremely sensitive to non-normality of X and Y, so you're probably better off doing a more robust test such as Levene's test or Bartlett's test unless you're reasonably sure that X and Y are distributed normally. These tests can be found in the scipy api:

Upvotes: 60

András Aszódi

Reputation: 9660

Here is a simple function to calculate the one-sided or two-sided F test with Python and SciPy. The results have been checked against the output of the var.test() function in R. Please keep in mind the warnings mentioned in the other answers concerning the sensitivity of the F-test to non-normality.

import scipy.stats as st

def f_test(x, y, alt="two_sided"):
    """
    Calculates the F-test.
    :param x: The first group of data
    :param y: The second group of data
    :param alt: The alternative hypothesis, one of "two_sided" (default), "greater" or "less"
    :return: a tuple with the F statistic value and the p-value.
    """
    df1 = len(x) - 1
    df2 = len(y) - 1
    f = x.var() / y.var()
    if alt == "greater":
        p = 1.0 - st.f.cdf(f, df1, df2)
    elif alt == "less":
        p = st.f.cdf(f, df1, df2)
    else:
        # two-sided by default
        # Crawley, the R book, p.355
        p = 2.0*(1.0 - st.f.cdf(f, df1, df2))
    return f, p

Upvotes: 7

Ala Ham

Reputation: 179

if you need a two-tailed test, you can proceed as follow, i choosed alpha =0.05:

a = [1,2,1,2,1,2,1,2,1,2]
b = [1,3,-1,2,1,5,-1,6,-1,2]
print('Variance a={0:.3f}, Variance b={1:.3f}'.format(np.var(a, ddof=1), np.var(b, ddof=1)))
fstatistics = np.var(a, ddof=1)/np.var(b, ddof=1) # because we estimate mean from data
fdistribution = stats.f(len(a)-1,len(b)-1) # build an F-distribution object
p_value = 2*min(fdistribution.cdf(f_critical), 1-fdistribution.cdf(f_critical))
f_critical1 = fdistribution.ppf(0.025)
f_critical2 = fdistribution.ppf(0.975)
print(fstatistics,f_critical1, f_critical2 )
if (p_value<0.05):
    print('Reject H0', p_value)
else:
    print('Cant Reject H0', p_value)

if you want to proceed to an ANOVA like test where only large values can cause rejection, you can proceed to right-tail test, you need to pay attention to the order of variances (fstatistics = var1/var2 or var2/var1):

a = [1,2,1,2,1,2,1,2,1,2]
b = [1,3,-1,2,1,5,-1,6,-1,2]
print('Variance a={0:.3f}, Variance b={1:.3f}'.format(np.var(a, ddof=1), np.var(b, ddof=1)))
fstatistics = max(np.var(a, ddof=1), np.var(b, ddof=1))/min(np.var(a, ddof=1), np.var(b, ddof=1)) # because we estimate mean from data
fdistribution = stats.f(len(a)-1,len(b)-1) # build an F-distribution object 
p_value = 1-fdistribution.cdf(fstatistics)
f_critical = fd.ppf(0.95)
print(fstatistics, f_critical)
if (p_value<0.05):
    print('Reject H0', p_value)
else:
    print('Cant Reject H0', p_value)

The left-tailed can be done as follow :

a = [1,2,1,2,1,2,1,2,1,2]
b = [1,3,-1,2,1,5,-1,6,-1,2]
print('Variance a={0:.3f}, Variance b={1:.3f}'.format(np.var(a, ddof=1), np.var(b, ddof=1)))
fstatistics = min(np.var(a, ddof=1), np.var(b, ddof=1))/max(np.var(a, ddof=1), np.var(b, ddof=1)) # because we estimate mean from data
fdistribution = stats.f(len(a)-1,len(b)-1) # build an F-distribution object
p_value = fdistribution.cdf(fstatistics)
f_critical = fd.ppf(0.05)
print(fstatistics, f_critical)
if (p_value<0.05):
    print('Reject H0', p_value)
else:
    print('Cant Reject H0', p_value)

Upvotes: 4

Ryszard Cetnarski

Reputation: 2052

To do a one way anova you can use

import scipy.stats as stats

stats.f_oneway(a,b)

One way Anova checks if the variance between the groups is greater then the variance within groups, and computes the probability of observing this variance ratio using F-distribution. A good tutorial can be found here:

https://www.khanacademy.org/math/probability/statistics-inferential/anova/v/anova-1-calculating-sst-total-sum-of-squares

Upvotes: 10

slushy

Reputation: 3357

For anyone who came here searching for an ANOVA F-test or to compare between models for feature selection

sklearn.feature_selection.f_classif does ANOVA tests, and
sklearn.feature_selection.f_regression does sequential testing of regressions

Upvotes: 13

How do I do a F-test in python

Answers (5)

Related Questions