John Thomas
John Thomas

Reputation: 1105

logistic regression choosing the right variables in R

okay so I have data like this, but with more variables of similar type

Company         Job  Month  Reported  Injury.Loc  Age
      1  Cartpenter      2         0         Leg   23
      2    Mechanic     12         1         Arm   33
      3       Legal      1         1         Arm   24
      4   Carpenter      1         1         Leg   75
      5       Legal      4         0        Head   23
      3      Dental      6         1       Wrist   40

I can't run the following logistic regression on it bc of the categorical nature of the variables

log_m1 <- glm(Reported ~. , data = df, family = "binomial")

Is there any way to break up all categorical variables at once AND preserve/keep all numeric variables?

So basically, code to keep the vars I need for the log reg to work.

ERROR:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

Upvotes: 0

Views: 964

Answers (2)

Allan Cameron
Allan Cameron

Reputation: 173793

You can run logistic regression on a mixture of numeric and categorical independent variables - that's not why you're getting the error message.

Let's first show that we can run a regression like this without a problem:

set.seed(69)

df <- data.frame(sex = factor(sample(c("Male", "Female"), 100, TRUE)),
                 age = sample(21:90, 100, TRUE),
                 outcome = sample(0:1, 100, TRUE))

glm(outcome ~ ., data = df, family = "binomial")
#> 
#> Call:  glm(formula = outcome ~ ., family = "binomial", data = df)
#> 
#> Coefficients:
#> (Intercept)      sexMale          age  
#>    0.169183     0.019774    -0.003115  
#> 
#> Degrees of Freedom: 99 Total (i.e. Null);  97 Residual
#> Null Deviance:       138.6 
#> Residual Deviance: 138.5     AIC: 144.5

But we can replicate your error if we make all the values for sex the same:

df2 <- within(df, sex <- rep("Male", 100))

glm(outcome ~ ., data = df2, family = "binomial")
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): 
#> contrasts can be applied only to factors with 2 or more levels

So you presumably have a column in your data that has only a single factor level (or only one unique non-NA value). Remove this, and your regression should run as expected.

Upvotes: 1

Duck
Duck

Reputation: 39585

You can use next approach with vtreat and magritt packages and also dplyr. Here the code:

library(vtreat)
library(dplyr)
library(magrittr)
#Data
df <- structure(list(Company = c(1L, 2L, 3L, 4L, 5L, 3L), Job = c("Cartpenter", 
"Mechanic", "Legal", "Carpenter", "Legal", "Dental"), Month = c(2L, 
12L, 1L, 1L, 4L, 6L), Reported = c(0L, 1L, 1L, 1L, 0L, 1L), Injury.Loc = c("Leg", 
"Arm", "Arm", "Leg", "Head", "Wrist"), Age = c(23L, 33L, 24L, 
75L, 23L, 40L)), class = "data.frame", row.names = c(NA, -6L))

First, we have to isolate the variables to transform and separate in different dataframes (df will keep numeric vars and df2 categorical vars):

#Isolate data variables of type character
vars <- c("Job","Injury.Loc")
df2 <- df[,vars]
df <- df[,-which(names(df) %in% vars)]

With that done, we use designTreatmentsZ() and use_series to treat the variables and assign in a new dataframe:

#Code for dummy vars
treatplan <- designTreatmentsZ(df2, vars)
#Process
scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(varName, origName, code)

Now, we isolate the treated variables in newvars using a filter():

#Select
newvars <- scoreFrame %>%
    filter(code %in% c("clean", "lev")) %>%
    use_series(varName)

We extract the new variables in a new dataframe:

#Create new data
dframe.treat <- prepare(treatplan, df2, varRestriction = newvars)

Finally, we add to the dataframe with numeric variables:

#Bind with original df
newdf <- cbind(df,dframe.treat)

The data will look like this (only showed some variables due to space):

  Company Month Reported Age Job_lev_x_Carpenter Job_lev_x_Cartpenter Job_lev_x_Dental
1       1     2        0  23                   0                    1                0
2       2    12        1  33                   0                    0                0
3       3     1        1  24                   0                    0                0
4       4     1        1  75                   1                    0                0
5       5     4        0  23                   0                    0                0
6       3     6        1  40                   0                    0                1

And after that you can create the model. Just be careful with singularities otherwise model can give wrong conclusions.

#Model
log_m1 <- glm(Reported ~. , data = newdf, family = "binomial")

Upvotes: 0

Related Questions