chi squared hypothesis test with several binary variables

Question

I have a data about plants grown in a nursery. I have a variable for plant health and several factors.

I wanted to test if any of the factors influenced plant health, so I thought the best method would be to use a chi squared test.

My method is below, but I get stuck after the cross tab

# Example Data
df = pd.DataFrame({'plant_health': ['a','b','c','a','b','b'],
                   'factor_1': ['yes','no','no','no','yes','yes'],
                   'factor_2': ['yes','yes','no','no','yes','yes'],
                   'factor_3': ['yes','no','no','yes','yes','yes'],
                   'factor_4': ['yes','yes','no','no','yes','yes'],
                   'factor_5': ['yes','no','yes','no','yes','yes'],
                   'factor_6': ['yes','no','no','no','yes','yes'],
                   'factor_7': ['yes','yes','no','yes','yes','yes'],
                   'factor_8': ['yes','no','yes','no','yes','yes'],
                   'factor_9': ['yes','yes','yes','yes','yes','yes'],
                   })

# Melt dataframe
df = df.melt(id_vars='plant_health', 
         value_vars=['factor_1', 'factor_2', 'factor_3', 'factor_4', 'factor_5',
       'factor_6', 'factor_7', 'factor_8', 'factor_9'])

# Create cross tab
pd.crosstab(df.plant_health, columns=[df.variable, df.value])

I can do the test with one factor but don't know how to expand that to all factors.

from scipy.stats import chisquare
from scipy import stats
from scipy.stats import chi2_contingency

# Example with only the first factor
tab_data = [[1,1], [1,2],[1,0]]
chi2_contingency(tab_data)

Khaled DELLAL · Accepted Answer

Try this please and let me know if it's what you expect:

tab = pd.crosstab(df.plant_health, columns=[df.variable, df.value])

chi2_contingency(tab)

Output

(20.666666666666668,
 0.9387023859836788,
 32,
 array([[1.        , 1.        , 0.66666667, 1.33333333, 0.66666667,
         1.33333333, 0.66666667, 1.33333333, 0.66666667, 1.33333333,
         1.        , 1.        , 0.33333333, 1.66666667, 0.66666667,
         1.33333333, 2.        ],
        [1.5       , 1.5       , 1.        , 2.        , 1.        ,
         2.        , 1.        , 2.        , 1.        , 2.        ,
         1.5       , 1.5       , 0.5       , 2.5       , 1.        ,
         2.        , 3.        ],
        [0.5       , 0.5       , 0.33333333, 0.66666667, 0.33333333,
         0.66666667, 0.33333333, 0.66666667, 0.33333333, 0.66666667,
         0.5       , 0.5       , 0.16666667, 0.83333333, 0.33333333,
         0.66666667, 1.        ]]))

EDIT

As you can do the individual chi-squared test by using a function like:

# we can use this to first df (without melt)

def  chi_squared_test(plant_health, factor_n):

    tab = pd.crosstab(plant_health, factor_n)

    return chi2_contingency(tab)

chi_squared_test(df.plant_health, df.factor_9)

Output

(1.3333333333333333,
 0.5134171190325922,
 2,
 array([[1. , 1. ],
        [1.5, 1.5],
        [0.5, 0.5]]))

chi squared hypothesis test with several binary variables

Answers (1)

Related Questions