Reputation: 31
I have a data set containing a continuous and two categorical variables, by using which I run two-way ANOVA by computing sums of squares and mean squares from scratch. Then I checked my codes' reliability by comparing the results to Python statsmodels package. However, there were discrepancies which I could not understand nor explain.
Below is my code computing sum of squares:
import pandas as pd
p, q = len(pd.unique(df.category1)), len(pd.unique(df.category2))
n1, n2 = df.groupby('category1').size(), df.groupby('category2').size()
y_groupmean1, y_groupmean2 = df.groupby('category1')['feature'].mean(), df.groupby('category2')['feature'].mean()
y_totalmean = df['feature'].mean()
y_group12 = df.groupby(['category1','category2'])['feature']
ss_b1, ss_b2 = sum(n1*(y_groupmean1 - y_totalmean)**2), sum(n2*(y_groupmean2 - y_totalmean)**2)
ss_e = sum([sum((df['feature'][(df['category1']==f'G{i}') & (df['category2']==f'H{j}')] - y_group12.get_group((f'G{i}',f'H{j}')).mean())**2) for i in range(p) for j in range(q)])
ss_t = sum((df['feature']-y_totalmean)**2)
ss_i = ss_t-ss_b1-ss_b2-ss_e
Here is the ANOVA table using my codes: my ANOVA (I'm sorry but you need to click the link to view the image, I don't have enough reputation to embed the pictures in the body text yet.. :( )
Then, I checked my results against Python statsmodels package by using codes below:
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('feature ~ C(category1) + C(category2) + C(category1):C(category2)', data=df).fit()
sm.stats.anova_lm(model, type=2)
Here's the ANOVA table statsmodel returns: statsmodels ANOVA (again please click the link to view the image)
As you can see, while statsmodel yields the same results as mine for the first category and the residual sum of squares, there is a small yet clear discrepancy between sum of squares for the second categorical variable, hence resulting in different sum of squares for the interaction.
I think it is really weird that statsmodel gives a different result for only one of the two factors as compared to my computation, since I use basically the same formula to calculate sum of squares for both factors. Can anyone help me figure out why these discrepancies occur?
Thank you!
I've also checked out this Kaggle code, and it seems statsmodels is returning different results for residual and interaction terms. I'm so confused... please help me clear this out!
Upvotes: 2
Views: 526
Reputation: 31
Figured it out.
The problem comes from the fact that the samples are imbalanced. The order of factor variables in the ols model matters in this case. That is, Y ~ a+b is not the same as Y ~ b+a.
I ran additional tests and switched the order of factor variables in the ols estimation, and the results show that only the factor appearing first in the estimation equation and the residual match the manual computation results in terms of sums of squares.
Upvotes: 1