Sam
Sam

Reputation: 654

Caret: Unable to adjust performance metric when using rfe function

I'm trying to perform recursive feature elimination using the rfe function but I'm having a bit of trouble trying to change the performance measure to output the ROC:

newFunc <- caretFuncs
newFunc$summary <- twoClassSummary 

ctrl <- rfeControl(functions = newFunc, 
                   method = 'cv',
                   returnResamp = TRUE,
                   number = 2,
                   verbose = TRUE)

profiler <- rfe(predictors, response, 
                sizes = c(1), 
                method = 'nnet',
                tuneGrid = expand.grid(size=c(4), decay=c(0.1)), 
                maxit = 20,
                metric = 'ROC', 
                rfeControl = ctrl) 

Trying to run this code is giving me the following error:

Error in { : task 1 failed - "undefined columns selected"

If I remove the custom newFunc, set the functions parameter inside rfeControl to use caretFuncs and remove the metric parameter from rfe, the model works fine. This makes me think there's something wrong with the summary.

caretFuncs$summary:

function (data, lev = NULL, model = NULL) 
{
    if (is.character(data$obs)) 
        data$obs <- factor(data$obs, levels = lev)
    postResample(data[, "pred"], data[, "obs"])
}

twoClassSummary

function (data, lev = NULL, model = NULL) 
{
    lvls <- levels(data$obs)
    if (length(lvls) > 2) 
        stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
    requireNamespaceQuietStop("ModelMetrics")
    if (!all(levels(data[, "pred"]) == lvls)) 
        stop("levels of observed and predicted data do not match")
    data$y = as.numeric(data$obs == lvls[2])
    rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0, 
        1), data[, lvls[1]])
    out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], 
        lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
    names(out) <- c("ROC", "Sens", "Spec")
    out
}

The output to postResample and twoClassSummary are identical in their structures so I'm a little lost as to what this problem is. Am I doing something inherently wrong here or is this a bug that I need to flag to the devs?


I'm actually interested in obtaining the logLoss so I could write my own function:

logLoss = function(data, lev = NULL, model = NULL) {
  -1*mean(log(data[, 'pred'][model.matrix(~ as.numeric(data[, 'obs'], levels = lev) + 0) - data[, 'pred'] > 0]))
}

But, I'm a little unsure how to convert the factor levels into the correct [0,1] from my [yes, no] factor?

Upvotes: 1

Views: 737

Answers (2)

missuse
missuse

Reputation: 19716

First of all here is a viable logloss function for use with caret:

LogLoss <- function (data, lev = NULL, model = NULL) 
{ 
  obs <- data[, "obs"]
  cls <- levels(obs) #find class names
  probs <- data[, cls[2]] #use second class name
  probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
  logPreds <- log(probs)        
  log1Preds <- log(1 - probs)
  real <- (as.numeric(data$obs) - 1)
  out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
  names(out) <- c("LogLoss")
  out
}

to answer the question how to convert the factor levels into the correct [0,1] from my [yes, no] factor:

real <- (as.numeric(data$obs) - 1)

to get rfe to work you can use rfFuncs instead of caretFuncs. Example:

rfFuncs$summary <- twoClassSummary

ctrl <- rfeControl(functions = rfFuncs, 
                   method = 'cv',
                   returnResamp = TRUE,
                   number = 2,
                   verbose = TRUE)

profiler <- rfe(Sonar[,1:60], Sonar$Class, 
                sizes = c(1, 5, 20, 40, 60), 
                method = 'nnet',
                tuneGrid = expand.grid(size=c(4), decay=c(0.1)), 
                maxit = 20,
                metric = 'ROC', 
                rfeControl = ctrl)

profiler$results
  Variables       ROC      Sens      Spec      ROCSD      SensSD      SpecSD
1         1 0.6460027 0.6387987 0.5155187 0.08735968 0.132008571 0.007516016
2         5 0.7563971 0.6847403 0.7013180 0.03751483 0.008724045 0.039383924
3        20 0.8633511 0.8462662 0.7017432 0.08460677 0.091143309 0.097708207
4        40 0.8841540 0.8642857 0.7429847 0.08096697 0.090913729 0.098309489
5        60 0.8945351 0.9004870 0.7431973 0.05707867 0.064971175 0.127471631

or with the LogLoss function I provided:

rfFuncs$summary <- LogLoss
ctrl <- rfeControl(functions = rfFuncs, 
                   method = 'cv',
                   returnResamp = TRUE,
                   number = 2,
                   verbose = TRUE)
profiler <- rfe(Sonar[,1:60], Sonar$Class, 
                sizes = c(1, 5, 20, 40, 60), 
                method = 'nnet',
                tuneGrid = expand.grid(size=c(4), decay=c(0.1)), 
                maxit = 20,
                metric = 'LogLoss', 
                rfeControl = ctrl,
                maximize = FALSE) #this was edited after the answer of Дмитрий Пасько) 

profiler$results
  Variables   LogLoss   LogLossSD
1         1 1.8237372 1.030120134
2         5 0.5548774 0.128704686
3        20 0.4226522 0.021547998
4        40 0.4167819 0.013587892
5        60 0.4328718 0.008000892

EDIT: Дмитрий Пасько raises a valid concern in his answer - LogLoss should be minimized. One way to achieve this is to provide the logical argument maximize telling caret should the metric be minimized or maximized.

Upvotes: 2

Demetry Pascal
Demetry Pascal

Reputation: 544

but u should minimize logLoss, thus use this code (example with logistic regression https://www.kaggle.com/demetrypascal/rfe-logreg-with-pca-and-feature-importance):

LogLoss <- function (data, lev = NULL, model = NULL) 
{ 
  obs <- data[, "obs"]
  cls <- levels(obs) #find class names
  probs <- data[, cls[2]] #use second class name
  probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
  logPreds <- log(probs)        
  log1Preds <- log(1 - probs)
  real <- (as.numeric(data$obs) - 1)
  out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
  names(out) <- c("LogLossNegative")
  -out
}

lrFuncs$summary <- LogLoss

rfec = rfeControl(method = "cv",
                     number = 2,
                     functions = lrFuncs)

Upvotes: 1

Related Questions