underclosed
underclosed

Reputation: 67

One-way Anova loop through pandas dataframe - results in a single table

I have a pandas dataframe containing 16 columns, of which 14 represent variables where i perform a looped Anova test using statsmodels. My dataframe looks something like this (simplified):

ID    Cycle_duration    Average_support_phase    Average_swing_phase    Label
1               23.1                     34.3                   47.2        1
2               27.3                     38.4                   49.5        1
3               25.8                     31.1                   45.7        1
4               24.5                     35.6                   41.9        1
...

So far this is what i'm doing:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.read_csv('features_total.csv')

for variable in df.columns:
    model = ols('{} ~ Label'.format(variable), data=df).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)
    print(anova_table)

Which yields:

    sum_sq    df         F    PR(>F)
Label     0.124927   2.0  2.561424  0.084312
Residual  1.731424  71.0       NaN       NaN
              sum_sq    df         F    PR(>F)
Label      62.626057   2.0  4.969491  0.009552
Residual  447.374788  71.0       NaN       NaN
              sum_sq    df         F    PR(>F)
Label      62.626057   2.0  4.969491  0.009552
Residual  447.374788  71.0       NaN       NaN

I'm getting an individual table print for each variable where the Anova is performed. Basically what i want is to print one single table with the summarized results, or something like this:

                             sum_sq     df         F    PR(>F)
          Cycle_duration   0.1249270   2.0  2.561424  0.084312
                Residual   1.7314240  71.0       NaN       NaN
   Average_support_phase   62.626057   2.0  4.969491  0.009552
                Residual  447.374788  71.0       NaN       NaN
     Average_swing_phase   62.626057   2.0  4.969491  0.009552
                Residual  447.374788  71.0       NaN       NaN

I can already see a problem because this method always outputs the 'Label' nomenclature before the actual values, and not the variable name in question (like i've shown above, i would like to have the variable name above each 'residual'). Is this even possible with the statsmodels approach?

I'm fairly new to python and excuse me if this has nothing to do with statsmodels - in that case, please do elucidate me on what i should be trying.

Upvotes: 3

Views: 3977

Answers (1)

busybear
busybear

Reputation: 10590

You can collect the tables and concatenate them at the end of your loop. This method will create a hierarchical index, but I think that makes it a bit more clear. Something like this:

keys = []
tables = []
for variable in df.columns:
    model = ols('{} ~ Label'.format(variable), data=df).fit()
    anova_table = sm.stats.anova_lm(model, typ=2)

    keys.append(variable)
    tables.append(anova_table)

df_anova = pd.concat(tables, keys=keys, axis=0)

Somewhat related, I would also suggest correcting for multiple comparisons. This is more a statistical suggestion than a coding suggestion, but considering you are performing numerous statistical tests, it would make sense to account for the probability that one of the test would result in a false positive.

Upvotes: 3

Related Questions