Reputation: 67
I have a pandas dataframe containing 16 columns, of which 14 represent variables where i perform a looped Anova test using statsmodels
. My dataframe looks something like this (simplified):
ID Cycle_duration Average_support_phase Average_swing_phase Label
1 23.1 34.3 47.2 1
2 27.3 38.4 49.5 1
3 25.8 31.1 45.7 1
4 24.5 35.6 41.9 1
...
So far this is what i'm doing:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('features_total.csv')
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
Which yields:
sum_sq df F PR(>F)
Label 0.124927 2.0 2.561424 0.084312
Residual 1.731424 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I'm getting an individual table print for each variable where the Anova is performed. Basically what i want is to print one single table with the summarized results, or something like this:
sum_sq df F PR(>F)
Cycle_duration 0.1249270 2.0 2.561424 0.084312
Residual 1.7314240 71.0 NaN NaN
Average_support_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
Average_swing_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I can already see a problem because this method always outputs the 'Label' nomenclature before the actual values, and not the variable name in question (like i've shown above, i would like to have the variable name above each 'residual'). Is this even possible with the statsmodels
approach?
I'm fairly new to python and excuse me if this has nothing to do with statsmodels - in that case, please do elucidate me on what i should be trying.
Upvotes: 3
Views: 3977
Reputation: 10590
You can collect the tables and concatenate them at the end of your loop. This method will create a hierarchical index, but I think that makes it a bit more clear. Something like this:
keys = []
tables = []
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
Somewhat related, I would also suggest correcting for multiple comparisons. This is more a statistical suggestion than a coding suggestion, but considering you are performing numerous statistical tests, it would make sense to account for the probability that one of the test would result in a false positive.
Upvotes: 3