Reputation: 447
In the below example, I set up a df with 3 variables, predict, var1, and var2 (a factor).
When I run a model in caret or glmnet, the factor is converted into a dummy variable, such as var2b.
I'd like to extract the variable names programmatically and match the original variable names, not the dummy variable names -- is there a way to do this?
This is just an example, my real world problem has many variables with many different levels and therefore, I want to avoid doing this manually, like trying to substring out the "b".
Thanks!
library(caret)
library(glmnet)
df <- data.frame(predict = c('Y','Y','N','Y','N','Y','Y','N','Y','N'), var1 = c(1,2,5,1,6,7,3,4,5,6),
var2 = c('a','a','b','b','a','a','a','b','b','a'))
str(df)
# 'data.frame': 10 obs. of 3 variables:
# $ predict: Factor w/ 2 levels "N","Y": 2 2 1 2 1 2 2 1 2 1
# $ var1 : num 1 2 5 1 6 7 3 4 5 6
# $ var2 : Factor w/ 2 levels "a","b": 1 1 2 2 1 1 1 2 2 1
test <- train(predict ~ .,
data = df,
method = 'glmnet',
trControl = trainControl(classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE),
metric = 'ROC',
tuneGrid = expand.grid(alpha = 1,
lambda = .005))
predictors(test)
# [1] "var1" "var2b"
varImp(test)
# glmnet variable importance
# Overall
# var2b 100
# var1 0
coef(test)
# NULL
#################
x <- model.matrix(as.formula(predict~.),data=df)
x <- x[,-1] ##remove intercept
df$predict <- ifelse(df$predict == 'Y', TRUE, FALSE)
glmnet1 <- glmnet::cv.glmnet(x = x,
y = df$predict,
type.measure='auc',
nfolds=3,
alpha=1,
parallel = FALSE)
rownames(coef(glmnet1))
# [1] "(Intercept)" "var1" "var2b
Upvotes: 3
Views: 1762
Reputation: 57686
Per @CSJCampbell's answer: the glmnetUtils package allows you to do this, with both glmnet and cv.glmnet objects.
library(glmnetUtils)
m <- glmnet(mpg ~ ., data=mtcars)
all.vars(m$terms)
m2 <- cv.glmnet(mpg ~ ., data=mtcars)
all.vars(m2$terms)
Note that all.vars
also works for most other R model objects:
m3 <- lm(mpg ~ ., data=mtcars)
all.vars(delete.response(m3$terms))
glmnetUtils is available on CRAN, or you can get the dev version from Github. I'm currently finalising a major update which should be published to CRAN soon.
Upvotes: 1
Reputation: 2115
The formula
method for the 'train' object returns a 'formula' object with attributes that you are looking for.
f1 <- formula(test)
f1
# predict ~ var1 + var2
# attr(,"variables")
# list(predict, var1, var2)
# attr(,"factors")
# var1 var2
# predict 0 0
# var1 1 0
# var2 0 1
# attr(,"term.labels")
# [1] "var1" "var2"
# attr(,"order")
# [1] 1 1
# attr(,"intercept")
# [1] 1
# attr(,"response")
# [1] 1
# attr(,"predvars")
# list(predict, var1, var2)
# attr(,"dataClasses")
# predict var1 var2
# "factor" "numeric" "factor"
attr(f1, "term.labels")
# [1] "var1" "var2"
It does not appear that the variable names are available in the 'cv.glmnet' object. I am not aware of an elegant way of collecting these. The glmnetUtils package might have some quality of life functions.
Here is some code you could try; note that this will return false positives since it is searching for column names by pattern from the input data (e.g. "var11" will match "var1").
# a generic method
termLabels <- function(object, ...) {
UseMethod("termLabels")
}
# add for the train object too to save typing
termLabels.train <- function(object, ...) {
attr(formula(object), "term.labels")
}
# try to find term labels for cv.glmnet object
# lambda must be provided and snaps to search grid
# allowed column names must be provided from corresponding data object
termLabels.cv.glmnet <- function(object, lambda, names, ...) {
if (missing(lambda)) { stop("lambda is missing") }
if (missing(names)) { stop("names is missing") }
# match lambda
lambdaArray <- object$glmnet.fit$a0
if (lambda > max(lambdaArray) || lambda < min(lambdaArray)) {
stop(paste("lambda must be in range",
paste(range(lambdaArray), collapse = ":")))
}
# find closest lambda
whichLambda <- which.min(abs(lambdaArray - lambda))
message(paste("using lambda", lambdaArray[whichLambda]))
# matrix of parameter estimates
betaLambda <- object$glmnet.fit$beta[, whichLambda, drop = FALSE]
# non-zero estimates
betaLambda <- betaLambda[betaLambda[, 1] != 0, , drop = FALSE]
vars <- rownames(betaLambda)
# search with names as pattern
# note, does not account for nested names, e.g. var1 and var11
matchNames <- apply(matrix(names), MARGIN = 1, FUN = grepl, x = vars)
names[apply(matchNames, MARGIN = 2, FUN = any)]
}
termLabels(glmnet1, lambda = 1, names = colnames(df))
# using lambda 0.998561314952713
# [1] "var1" "var2"
Upvotes: 1