rf7
rf7

Reputation: 2191

Understanding the what is returned by the performance() function of ROCR - in Classification in R

I have difficulty understanding what is returned by the performance() function of the ROCR package. Let me be concrete with a reproducible example. I use the mpg dataset. My code is the following:

library(ROCR)
library(ggplot2)
library(data.table)
library(caTools)
data(mpg)
setDT(mpg)
mpg[year == 1999, Year99 := 1]
mpg[year == 2008, Year99 := 0]
table(mpg$Year99)
# 0   1 
# 117 117 
split <- sample.split(mpg$Year99, SplitRatio = 0.75)
mpg_train <- mpg[split, ]
mpg_test <- mpg[!split, ]
model <- glm(Year99 ~ displ, mpg_train, family = "binomial")
summary(model)
predict_mpg_test <- predict(model, type = "response", newdata = mpg_test)
ROCR_mpg_test <- prediction(predict_mpg_test, mpg_test$Year99)
performance(ROCR_mpg_test, "acc")

#An object of class "performance"
#Slot "x.name":
#  [1] "Cutoff"

#Slot "y.name":
#  [1] "Accuracy"

#Slot "alpha.name":
#  [1] "none"

#Slot "x.values":
#  [[1]]
#49        55        56        45        47        53        51        57        46        13        39        37        58 
#Inf 0.5983963 0.5926422 0.5868625 0.5752326 0.5635187 0.5576343 0.5458183 0.5398901 0.5280013 0.5220441 0.5101127 0.4981697 0.4921981 
#17        44        31        32        33        50        34        40        24        21        12 
#0.4802634 0.4683511 0.4564748 0.4446478 0.4328831 0.4270282 0.4095919 0.3923800 0.3866994 0.3698468 0.3265163 


#Slot "y.values":
#  [[1]]
#[1] 0.5000000 0.5172414 0.5344828 0.5344828 0.5517241 0.5344828 0.4827586 0.5000000 0.5862069 0.6206897 0.6034483 0.6206897 0.5862069
#[14] 0.5689655 0.5517241 0.5689655 0.5517241 0.5344828 0.5517241 0.5172414 0.5344828 0.4655172 0.4827586 0.4827586 0.5000000


#Slot "alpha.values":
#  list()

My questions are:

  1. What are the 4 rows of numbers listed under Slot "x.values"?
  2. What are the 2 rows of numbers listed under Slot "y.values"?
  3. Is it possible that I pass to the ROCR function a sequence of cutoffs --e.g. cutoff = seq(0.05, 0.95, 0.05) -- and return to me the value of a defined metric --e.g. accuracy --- for each cut-off level?

Your advice will be appreciated.

Upvotes: 4

Views: 2631

Answers (1)

Marco Sandri
Marco Sandri

Reputation: 24252

(1) In the x.values slot you can find cutoffs.
This vector of cutoffs contains the set of unique values of the probabilities predicted by the model:

prf <- performance(ROCR_mpg_test, "acc")

cutoffs <- [email protected][[1]]
pred.probs <- sort(unique(predict_mpg_test), decreasing=T)
all(cutoffs[-1] == pred.probs)
# [1] TRUE

(2) In the y.values slot there are the accuracies for each cutoff.

accuracies1 <- [email protected][[1]]

# Example. Calculate accuracy for the 3rd cutoff
( tbl <- table(predict_mpg_test>= cutoffs[3], mpg_test$Year99) )
#         0  1
#  FALSE 28 25
#  TRUE   1  4
sum(diag(tbl))/sum(tbl)
# [1] 0.5517241
accuracies1[3]
# [1] 0.5517241

# Calcuate the accuracies for each cutoff
calc_accur <- function(cutoff, pred_prob, response_var) {
  confusion_matrix <- table( pred_prob >= cutoff, response_var) 
  sum(diag(confusion_matrix))/sum(confusion_matrix)
}

accuracies2 <- sapply(cutoffs, calc_accur,  
       pred_prob=predict_mpg_test, response_var=mpg_test$Year99)

all(accuracies1==accuracies2)
# [1] TRUE

(3) Using the calc_accur function given in (2) and sapply it is possible to pass a sequence of cutoffs and to calculate the corresponding accuracies.
For example:

seq_cut <- seq(0.3, 0.6, length.out=10)
sapply(seq_cut, calc_accur,  
       pred_prob=predict_mpg_test, response_var=mpg_test$Year99)

# [1] 0.5000000 0.5000000 0.5000000 0.5172414 0.5517241 0.5862069 0.6551724 0.6379310
# [9] 0.5517241 0.5000000

Upvotes: 5

Related Questions