June Skeeter
June Skeeter

Reputation: 1218

Statsmodels GLM and OLS with formulas missing paramters

I am trying to run a general linear model using formulas on a data set that contains categorical variables. The results summary table appears to be leaving out one of the variables when I list the parameters?

I haven't been able to find doc's specific to the glm showing the output with categorical variables but I have for the OLS and it looks like it should list each categorical variable seperately. When it do it (with GLM or OLS) it leaves out one of the values for each category. For example:

import statsmodels.formula.api as smf
import pandas as pd
Data = pd.read_csv(root+'/Illisarvik/TestData.csv')
formula = 'Response~Day+Class+Var'
gm = sm.GLM.from_formula(formula=formula, data=Data,
                           family=sm.families.Gaussian()).fit()
ls = smf.ols(formula=formula,data=Data).fit()

print (Data)
print(gm.params)
print(ls.params)



   Day Class       Var  Response
0   D     A  0.533088  0.582931
1   D     B  0.839837  0.075011
2   D     C  1.454716  0.505442
3   D     A  1.455503  0.188945
4   D     B  1.163155  0.144176
5   N     A  1.072238  0.918962
6   N     B  0.815384  0.249160
7   N     C  1.182626  0.520460
8   N     A  1.448843  0.870644
9   N     B  0.653531  0.460177

Intercept     0.625111
Day[T.N]      0.298084
Class[T.B]   -0.439025
Class[T.C]   -0.104725
Var          -0.118662
dtype: float64

Intercept     0.625111
Day[T.N]      0.298084
Class[T.B]   -0.439025
Class[T.C]   -0.104725
Var          -0.118662
dtype: float64
C:/Users/wesle/Dropbox/PhD_Work/Figures/SkeeterEtAlAnalysis.py:55: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
  P.sort()

Is there something wrong with my model? The same issue presents its self when I print the full summary table:

print(gm.summary())

print(ls.summary())


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:               Response   No. Observations:                   10
Model:                            GLM   Df Residuals:                        5
Model Family:                Gaussian   Df Model:                            4
Link Function:               identity   Scale:                 0.0360609978309
Method:                          IRLS   Log-Likelihood:                 5.8891
Date:                Sun, 05 Mar 2017   Deviance:                      0.18030
Time:                        23:26:48   Pearson chi2:                    0.180
No. Iterations:                     2                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6251      0.280      2.236      0.025       0.077       1.173
Day[T.N]       0.2981      0.121      2.469      0.014       0.061       0.535
Class[T.B]    -0.4390      0.146     -3.005      0.003      -0.725      -0.153
Class[T.C]    -0.1047      0.170     -0.617      0.537      -0.438       0.228
Var           -0.1187      0.222     -0.535      0.593      -0.553       0.316
==============================================================================

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Response   R-squared:                       0.764
Model:                            OLS   Adj. R-squared:                  0.576
Method:                 Least Squares   F-statistic:                     4.055
Date:                Sun, 05 Mar 2017   Prob (F-statistic):             0.0784
Time:                        23:26:48   Log-Likelihood:                 5.8891
No. Observations:                  10   AIC:                            -1.778
Df Residuals:                       5   BIC:                           -0.2652
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6251      0.280      2.236      0.076      -0.094       1.344
Day[T.N]       0.2981      0.121      2.469      0.057      -0.012       0.608
Class[T.B]    -0.4390      0.146     -3.005      0.030      -0.815      -0.064
Class[T.C]    -0.1047      0.170     -0.617      0.564      -0.541       0.332
Var           -0.1187      0.222     -0.535      0.615      -0.689       0.451
==============================================================================
Omnibus:                        1.493   Durbin-Watson:                   2.699
Prob(Omnibus):                  0.474   Jarque-Bera (JB):                1.068
Skew:                          -0.674   Prob(JB):                        0.586
Kurtosis:                       2.136   Cond. No.                         9.75
==============================================================================

Upvotes: 3

Views: 2230

Answers (1)

Bill Bell
Bill Bell

Reputation: 21643

This is a consequence of the way the linear model works.

For instance, where you have the categorical variable Day as far as the linear model is concerned this can be represented as just a single 'dummy' variable which is set to 0 (zero) for the value you mention first, namely D and one for the second value, namely N. Statistically speaking, you can recover only the difference between the effects of the two levels of this categorical variable.

If you now consider Class, which has two levels, you have two dummy variables which represent two differences between the levels of the available three levels of this categorical variable.

As a matter of fact, it's perfectly possible to expand on this idea using orthogonal polynomials on the treatment means but that's something for another day.

The short answer is that there's nothing wrong, at least on this account, with your model.

Upvotes: 1

Related Questions