Reputation: 1105
okay so I have data like this, but with more variables of similar type
Company Job Month Reported Injury.Loc Age
1 Cartpenter 2 0 Leg 23
2 Mechanic 12 1 Arm 33
3 Legal 1 1 Arm 24
4 Carpenter 1 1 Leg 75
5 Legal 4 0 Head 23
3 Dental 6 1 Wrist 40
I can't run the following logistic regression on it bc of the categorical nature of the variables
log_m1 <- glm(Reported ~. , data = df, family = "binomial")
Is there any way to break up all categorical variables at once AND preserve/keep all numeric variables?
So basically, code to keep the vars I need for the log reg to work.
ERROR:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Upvotes: 0
Views: 964
Reputation: 173793
You can run logistic regression on a mixture of numeric and categorical independent variables - that's not why you're getting the error message.
Let's first show that we can run a regression like this without a problem:
set.seed(69)
df <- data.frame(sex = factor(sample(c("Male", "Female"), 100, TRUE)),
age = sample(21:90, 100, TRUE),
outcome = sample(0:1, 100, TRUE))
glm(outcome ~ ., data = df, family = "binomial")
#>
#> Call: glm(formula = outcome ~ ., family = "binomial", data = df)
#>
#> Coefficients:
#> (Intercept) sexMale age
#> 0.169183 0.019774 -0.003115
#>
#> Degrees of Freedom: 99 Total (i.e. Null); 97 Residual
#> Null Deviance: 138.6
#> Residual Deviance: 138.5 AIC: 144.5
But we can replicate your error if we make all the values for sex
the same:
df2 <- within(df, sex <- rep("Male", 100))
glm(outcome ~ ., data = df2, family = "binomial")
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]):
#> contrasts can be applied only to factors with 2 or more levels
So you presumably have a column in your data that has only a single factor level (or only one unique non-NA value). Remove this, and your regression should run as expected.
Upvotes: 1
Reputation: 39585
You can use next approach with vtreat
and magritt
packages and also dplyr
. Here the code:
library(vtreat)
library(dplyr)
library(magrittr)
#Data
df <- structure(list(Company = c(1L, 2L, 3L, 4L, 5L, 3L), Job = c("Cartpenter",
"Mechanic", "Legal", "Carpenter", "Legal", "Dental"), Month = c(2L,
12L, 1L, 1L, 4L, 6L), Reported = c(0L, 1L, 1L, 1L, 0L, 1L), Injury.Loc = c("Leg",
"Arm", "Arm", "Leg", "Head", "Wrist"), Age = c(23L, 33L, 24L,
75L, 23L, 40L)), class = "data.frame", row.names = c(NA, -6L))
First, we have to isolate the variables to transform and separate in different dataframes (df
will keep numeric vars and df2
categorical vars):
#Isolate data variables of type character
vars <- c("Job","Injury.Loc")
df2 <- df[,vars]
df <- df[,-which(names(df) %in% vars)]
With that done, we use designTreatmentsZ()
and use_series
to treat the variables and assign in a new dataframe:
#Code for dummy vars
treatplan <- designTreatmentsZ(df2, vars)
#Process
scoreFrame <- treatplan %>%
use_series(scoreFrame) %>%
select(varName, origName, code)
Now, we isolate the treated variables in newvars
using a filter()
:
#Select
newvars <- scoreFrame %>%
filter(code %in% c("clean", "lev")) %>%
use_series(varName)
We extract the new variables in a new dataframe:
#Create new data
dframe.treat <- prepare(treatplan, df2, varRestriction = newvars)
Finally, we add to the dataframe with numeric variables:
#Bind with original df
newdf <- cbind(df,dframe.treat)
The data will look like this (only showed some variables due to space):
Company Month Reported Age Job_lev_x_Carpenter Job_lev_x_Cartpenter Job_lev_x_Dental
1 1 2 0 23 0 1 0
2 2 12 1 33 0 0 0
3 3 1 1 24 0 0 0
4 4 1 1 75 1 0 0
5 5 4 0 23 0 0 0
6 3 6 1 40 0 0 1
And after that you can create the model. Just be careful with singularities otherwise model can give wrong conclusions.
#Model
log_m1 <- glm(Reported ~. , data = newdf, family = "binomial")
Upvotes: 0