Python statsmodels, glm formula and categorical variables

Question

So let's suppose that I have a dataframe df and in this dataframe I have the columns O, A, B, D, X, Y, Z. Exo is a list of the name of the predictor variables, which are A, B, D, X, Y, Z. Now the first four columns are real variables and the last three are categorical variables such that

For any x in X, x is equal to exactly one of the following elements in the list ["RED", "ORANGE", "YELLOW", "GREEN", "BLUE", "INDIGO", "VIOLET"].
For any y in Y, y is equal to exactly one of the following elements in the list ["DO", "RE", "MI", "FA", "SO", "LA", "TI"].
For any z in Z, z is equal to exactly one of the following elements in the list ["LUST", "GLUTTONY", "GREED", "SLOTH", "WRATH", "ENVY", "PRIDE"]

So I sample one hundred elements of df and split the sampled set into train and test sets. Then I decide to write

mod = smf.glm(formula="O ~ A + B + D + C(X) + C(Y) + C(Z)",
              data=train,
              family=sm.families.Tweedie(var_power=1.5))
mod = mod.fit()
result = mod.predict(exog=test[exo])

But wait! It turns out that the possible value "yellow" doesn't occur in the training set but does occur in the test set, so smf.glm can't use induction. How do I prevent this kind of error from occurring?

Josef · Accepted Answer

One common way is to throw away those test-train splits that don't have every variable in each set. The more elaborate way of this is to use a splitter that guaranties that all variables are in each set.

The main problem is that the parameters for the explanatory variables that do not have any observations in the training dataset are not identified, and so we cannot estimate them. If we leave them out, then they would be assumed to be zero, and will also not be included in the prediction with the test dataset.

To get a split with the full list of variables even if there are no observation for a variable in a subset is to use patsy.dmatrices directly to create the design matrix for the full dataset, and split the design matrix and not the original data. This will provide a consistent parameterization and consistent columns for any subsets or partition of the dataset.

patsy also allows the specification of the levels when creating the design matrix for categorical variables, but I never tried to include "missing" levels.

Python statsmodels, glm formula and categorical variables

Answers (1)

Related Questions