Martin Stancsics
Martin Stancsics

Reputation: 390

Use every level of a categorical variable in a regression

Short description

I am trying to run a (GLM) regression in Matlab (using the fitglm function) where one of the regressors is a categorical variable. However instead of adding an intercept and dropping the first level, I would like to include each level of the categorical variable and exclude the constant term.

Motivation

I know, that theoretically the results are the same either way, but I have two reasons against estimating the model with a constant and interpreting the dummy level coefficients differently:

Tried approaches

I tried subclassing the GeneralizedLinearModel class but unfortunately it is marked as final. Class composition also does not work as I cannot even inherit from the parent of the GeneralizedLinearModel class. Modifying Matlab's files is no option as I use a shared Matlab installation.

The only idea I could come up with is using dummyvar or something similar to turn my categorical variable into a set of dummies, and then using these dummy variables in the regression. AFAIK this is how Matlab works internally, but by taking this approach I lose the user-friendliness of dealing with categorical variables.

P.S. This question was also posted on MatlabCentral at this link.

Upvotes: 2

Views: 801

Answers (1)

Martin Stancsics
Martin Stancsics

Reputation: 390

As there seems to be no built-in way to do this, I am posting a short function that I wrote to get the job done.

I have a helper function to convert the categorical variable into an array of dummies:

function dummyTable = convert_to_dummy_table(catVar)
    dummyTable = array2table(dummyvar(catVar));
    varName = inputname(1);
    levels = categories(catVar)';
    dummyTable.Properties.VariableNames = strcat(varName, '_', levels);
end

The usage is quite simple. If you have a table T with some continuous explanatory variables X1, X2, X3, a categorical explanatory variable C and a response variable Y, then instead of using

M = fitglm(T, 'Distribution', 'binomial', 'Link', 'logit', 'ResponseVar', 'Y')

which would fit a logit model using k - 1 levels for the categorical variable and an intercept, one would do

estTable = [T(:, {'X1', 'X2', 'X3', 'Y'}), convert_to_dummy_table(T.C)]
M = fitglm(estTable, 'Distribution', 'binomial', 'Link', 'logit', ... 
                     'ResponseVar', 'Y', 'Intercept', false)

It is not as nice and readable as the default way of handling categorical variables, but it has the advantage that the names of the dummy variables are identical to the names that Matlab automatically assigns during estimation using a categorical variable. Therefore the Coefficients table of the resulting M object is easy to parse or understand.

Upvotes: 1

Related Questions