Python statsmodel two-way anova's gives sum of squares that are different from manual computation from scratch

Question

I have a data set containing a continuous and two categorical variables, by using which I run two-way ANOVA by computing sums of squares and mean squares from scratch. Then I checked my codes' reliability by comparing the results to Python statsmodels package. However, there were discrepancies which I could not understand nor explain.

Below is my code computing sum of squares:

import pandas as pd

p, q = len(pd.unique(df.category1)), len(pd.unique(df.category2))
n1, n2 = df.groupby('category1').size(), df.groupby('category2').size()

y_groupmean1, y_groupmean2 = df.groupby('category1')['feature'].mean(), df.groupby('category2')['feature'].mean()
y_totalmean = df['feature'].mean()
y_group12 = df.groupby(['category1','category2'])['feature']

ss_b1, ss_b2 = sum(n1*(y_groupmean1 - y_totalmean)**2), sum(n2*(y_groupmean2 - y_totalmean)**2)
ss_e = sum([sum((df['feature'][(df['category1']==f'G{i}') & (df['category2']==f'H{j}')] - y_group12.get_group((f'G{i}',f'H{j}')).mean())**2) for i in range(p) for j in range(q)])

ss_t = sum((df['feature']-y_totalmean)**2)

ss_i = ss_t-ss_b1-ss_b2-ss_e

Here is the ANOVA table using my codes: my ANOVA (I'm sorry but you need to click the link to view the image, I don't have enough reputation to embed the pictures in the body text yet.. :( )

Then, I checked my results against Python statsmodels package by using codes below:

import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('feature ~ C(category1) + C(category2) + C(category1):C(category2)', data=df).fit()
sm.stats.anova_lm(model, type=2)

Here's the ANOVA table statsmodel returns: statsmodels ANOVA (again please click the link to view the image)

As you can see, while statsmodel yields the same results as mine for the first category and the residual sum of squares, there is a small yet clear discrepancy between sum of squares for the second categorical variable, hence resulting in different sum of squares for the interaction.

I think it is really weird that statsmodel gives a different result for only one of the two factors as compared to my computation, since I use basically the same formula to calculate sum of squares for both factors. Can anyone help me figure out why these discrepancies occur?

Thank you!

I've also checked out this Kaggle code, and it seems statsmodels is returning different results for residual and interaction terms. I'm so confused... please help me clear this out!

Python statsmodel two-way anova's gives sum of squares that are different from manual computation from scratch

Answers (1)

Related Questions

Python statsmodel two-way anova&#39;s gives sum of squares that are different from manual computation from scratch

Answers (1)

Related Questions

Python statsmodel two-way anova's gives sum of squares that are different from manual computation from scratch