davetracy
davetracy

Reputation: 11

maximum number of covariates in R glm package is 128?

I have a dataset with 146 covariates, and am training a logistic regression.

logit = glm(Y ~ .,
        data = pred.dataset[1:1000,],
        family = binomial)

The model trains very quickly, but when I then try to view the Beta's with

logit

After the 128th variable the Beta's are all "NA"

I noticed this when trying to export it as pmml and noticed it stopped listing Beta's after 128 predictors.

I've gone through the documentation and can't find a reference to a maximum number of covariates, and also trained on 60k rows - I still see NAs after the 128th predictor.

Is this a limitation of glm, or a limitation of my system? I am running R 3.1.2 64 bit. How can I increase the number of predictors?

Upvotes: 1

Views: 923

Answers (2)

Sarah Hirsch
Sarah Hirsch

Reputation: 66

You didn't provide reproducible data, so it's hard to tell exactly what is going on--is there an issue with how some of the variables are coded? Are variables that seem uniform not uniform at all? These would be a couple of situations that could be ruled out with a reproducible code example.

However, I'm answering because I think you may have a legitimate concern. What can you say about these other variables? What type are they? I have been trying to run some logits that seem to be dropping factor levels over 48.

What worked for me (at least to get the model to run in full) was going into the glm() function and changing

mf$drop.unused.levels <- TRUE

to

mf$drop.unused.levels <- FALSE

then saving the function under a different name and using that to run my analyses. (I was inspired by this answer.)

Be warned, though! It gave me some warning messages:

Warning messages:
1: In predict.lm(object, newdata, se.fit, scale = 1, type = if (type ==  :
  prediction from a rank-deficient fit may be misleading
2: In predict.lm(object, newdata, se.fit, scale = 1, type = if (type ==  :
  prediction from a rank-deficient fit may be misleading
3: In predict.lm(object, newdata, se.fit, scale = 1, type = if (type ==  :
  prediction from a rank-deficient fit may be misleading

I know that there are frequency issues in certain groups in the data; I have to analyze these separately and I will do so. But for the time being, I have achieved the prediction of all levels that I wanted.

The first step would be to check your data, though. Part of why this happens with my data is almost certainly due to issues in the data itself, but this approach let me override it. This may or may not be an appropriate solution for you.

Upvotes: 1

costebk08
costebk08

Reputation: 1359

This is a question I actually just asked on Stack Exchange, which is where this question should be. See this link: https://stats.stackexchange.com/questions/159316/logistic-regression-in-r-with-many-predictors?noredirect=1#comment303422_159316 and the subsequent links included in the thread. To answer your question though, basically that is too many predictors for logistic regression, and OLS can be used in this case, and even though it does not yield the best results for a binary outcome, the results are still valid and can be used.

Upvotes: 1

Related Questions