deschen
deschen

Reputation: 11016

Create ROC curve manually from data frame

I have the below conceptual problem which I can't get my head around.

Below is an example for survey data where I have a time column that indicates how long someone needs to respond to a certain question.

Now, I'm interested in how the amount of cleaning would change based on this threshold, i.e. what would happen if I increase the threshold, what would happen if I decrease it.

So my idea was to just create a ROC curve (or other model metrics) to have a visual cue about a potential threshold. The problem is that I don't have a machine-learning-like model that would give me class probabilities. So I was wondering if there's any way to create a ROC curve nonetheless with this type of data. I had the idea of just looping through my data at maybe 100 different thresholds, calculate false and true positive rates at each threshold and then do a simple line plot, but I was hoping for a more elegant solution that doesn't require me to loop.

Any ideas?

example data:


set.seed(3)
df <- data.frame(time      = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
                 truth     = rep(c("cleaned", "final"), each = 5)) %>%
  mutate(predicted = if_else(time < 2.5, "cleaned", "final"))

Upvotes: 3

Views: 2052

Answers (2)

Shibaprasad
Shibaprasad

Reputation: 1332

You can use ROCR too for this

library(ROCR)

set.seed(3)
df <- data.frame(time      = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
                 truth     = rep(c("cleaned", "final"), each = 5)) %>%
  mutate(predicted = if_else(time < 2.5, "cleaned", "final"))

pred <- prediction(df$time, df$truth)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)

ROC Curve

You can also check the AUC value:

auc <- performance(pred, measure = "auc")
[email protected][[1]]

[1] 0.92

Cross checking the AUC value with pROC

library(pROC)

roc(df$truth, df$time)

Call:
roc.default(response = df$truth, predictor = df$time)

Data: df$time in 5 controls (df$truth cleaned) < 5 cases (df$truth final).
Area under the curve: 0.92

For both the cases, it is same!

Upvotes: 4

Bernhard
Bernhard

Reputation: 4427

So my idea was to just create a ROC curve

Creating a ROC curve is as easy as

library(pROC)
set.seed(3)
data.frame(time      = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
           truth     = rep(c("cleaned", "final"), each = 5)) |>
    roc(truth, time) |>
    plot()

enter image description here

The problem is that I don't have a machine-learning-like model that would give me class probabilities.

Sorry, I do not understand what is machine-learning-like about the question.

I had the idea of just looping through my data at maybe 100 different thresholds

There is no point in looping over 100 possible thresholds if you got 10 observations. Sensible cutoffs are the nine situated in between your time values. You can get those from roc:

df <- data.frame(time      = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
                truth     = rep(c("cleaned", "final"), each = 5))

thresholds <- roc(df, truth, time)$thresholds
print(thresholds)

or

> print(thresholds)
 [1]     -Inf 1.195612 1.739608 1.968531 2.155908 2.329745 2.561073
 [8] 3.093424 3.969994 4.586341      Inf

What exactly is implied in the term looping and whether you want to exclude just a for and a while loop or whatever exactly you consider to be a loop needs some precise definition. Is c(1, 2, 3, 4) * 5 a loop? There will be a loop running under the hood.

Upvotes: 4

Related Questions