Reputation: 654
I'm trying to perform recursive feature elimination using the rfe
function but I'm having a bit of trouble trying to change the performance measure to output the ROC:
newFunc <- caretFuncs
newFunc$summary <- twoClassSummary
ctrl <- rfeControl(functions = newFunc,
method = 'cv',
returnResamp = TRUE,
number = 2,
verbose = TRUE)
profiler <- rfe(predictors, response,
sizes = c(1),
method = 'nnet',
tuneGrid = expand.grid(size=c(4), decay=c(0.1)),
maxit = 20,
metric = 'ROC',
rfeControl = ctrl)
Trying to run this code is giving me the following error:
Error in { : task 1 failed - "undefined columns selected"
If I remove the custom newFunc
, set the functions
parameter inside rfeControl
to use caretFuncs
and remove the metric
parameter from rfe
, the model works fine. This makes me think there's something wrong with the summary.
caretFuncs$summary:
function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs))
data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}
twoClassSummary
function (data, lev = NULL, model = NULL)
{
lvls <- levels(data$obs)
if (length(lvls) > 2)
stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
requireNamespaceQuietStop("ModelMetrics")
if (!all(levels(data[, "pred"]) == lvls))
stop("levels of observed and predicted data do not match")
data$y = as.numeric(data$obs == lvls[2])
rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0,
1), data[, lvls[1]])
out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"],
lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
names(out) <- c("ROC", "Sens", "Spec")
out
}
The output to postResample
and twoClassSummary
are identical in their structures so I'm a little lost as to what this problem is. Am I doing something inherently wrong here or is this a bug that I need to flag to the devs?
I'm actually interested in obtaining the logLoss
so I could write my own function:
logLoss = function(data, lev = NULL, model = NULL) {
-1*mean(log(data[, 'pred'][model.matrix(~ as.numeric(data[, 'obs'], levels = lev) + 0) - data[, 'pred'] > 0]))
}
But, I'm a little unsure how to convert the factor levels into the correct [0,1]
from my [yes, no]
factor?
Upvotes: 1
Views: 737
Reputation: 19716
First of all here is a viable logloss function for use with caret:
LogLoss <- function (data, lev = NULL, model = NULL)
{
obs <- data[, "obs"]
cls <- levels(obs) #find class names
probs <- data[, cls[2]] #use second class name
probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
logPreds <- log(probs)
log1Preds <- log(1 - probs)
real <- (as.numeric(data$obs) - 1)
out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
names(out) <- c("LogLoss")
out
}
to answer the question how to convert the factor levels into the correct [0,1] from my [yes, no] factor:
real <- (as.numeric(data$obs) - 1)
to get rfe
to work you can use rfFuncs
instead of caretFuncs
. Example:
rfFuncs$summary <- twoClassSummary
ctrl <- rfeControl(functions = rfFuncs,
method = 'cv',
returnResamp = TRUE,
number = 2,
verbose = TRUE)
profiler <- rfe(Sonar[,1:60], Sonar$Class,
sizes = c(1, 5, 20, 40, 60),
method = 'nnet',
tuneGrid = expand.grid(size=c(4), decay=c(0.1)),
maxit = 20,
metric = 'ROC',
rfeControl = ctrl)
profiler$results
Variables ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.6460027 0.6387987 0.5155187 0.08735968 0.132008571 0.007516016
2 5 0.7563971 0.6847403 0.7013180 0.03751483 0.008724045 0.039383924
3 20 0.8633511 0.8462662 0.7017432 0.08460677 0.091143309 0.097708207
4 40 0.8841540 0.8642857 0.7429847 0.08096697 0.090913729 0.098309489
5 60 0.8945351 0.9004870 0.7431973 0.05707867 0.064971175 0.127471631
or with the LogLoss function I provided:
rfFuncs$summary <- LogLoss
ctrl <- rfeControl(functions = rfFuncs,
method = 'cv',
returnResamp = TRUE,
number = 2,
verbose = TRUE)
profiler <- rfe(Sonar[,1:60], Sonar$Class,
sizes = c(1, 5, 20, 40, 60),
method = 'nnet',
tuneGrid = expand.grid(size=c(4), decay=c(0.1)),
maxit = 20,
metric = 'LogLoss',
rfeControl = ctrl,
maximize = FALSE) #this was edited after the answer of Дмитрий Пасько)
profiler$results
Variables LogLoss LogLossSD
1 1 1.8237372 1.030120134
2 5 0.5548774 0.128704686
3 20 0.4226522 0.021547998
4 40 0.4167819 0.013587892
5 60 0.4328718 0.008000892
EDIT: Дмитрий Пасько raises a valid concern in his answer - LogLoss should be minimized. One way to achieve this is to provide the logical argument maximize
telling caret should the metric be minimized or maximized.
Upvotes: 2
Reputation: 544
but u should minimize logLoss, thus use this code (example with logistic regression https://www.kaggle.com/demetrypascal/rfe-logreg-with-pca-and-feature-importance):
LogLoss <- function (data, lev = NULL, model = NULL)
{
obs <- data[, "obs"]
cls <- levels(obs) #find class names
probs <- data[, cls[2]] #use second class name
probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
logPreds <- log(probs)
log1Preds <- log(1 - probs)
real <- (as.numeric(data$obs) - 1)
out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
names(out) <- c("LogLossNegative")
-out
}
lrFuncs$summary <- LogLoss
rfec = rfeControl(method = "cv",
number = 2,
functions = lrFuncs)
Upvotes: 1