Reputation: 11016
I have the below conceptual problem which I can't get my head around.
Below is an example for survey data where I have a time column that indicates how long someone needs to respond to a certain question.
Now, I'm interested in how the amount of cleaning would change based on this threshold, i.e. what would happen if I increase the threshold, what would happen if I decrease it.
So my idea was to just create a ROC curve (or other model metrics) to have a visual cue about a potential threshold. The problem is that I don't have a machine-learning-like model that would give me class probabilities. So I was wondering if there's any way to create a ROC curve nonetheless with this type of data. I had the idea of just looping through my data at maybe 100 different thresholds, calculate false and true positive rates at each threshold and then do a simple line plot, but I was hoping for a more elegant solution that doesn't require me to loop.
Any ideas?
example data:
time
column indidates the time needed per casetruth
column indicates my current decision I want to compare againstpredicted
column indicates the cleaning decision if I would cut at a time threshold of 2.5s
. This is waht I need to change/loop through.set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
Upvotes: 3
Views: 2052
Reputation: 1332
You can use ROCR
too for this
library(ROCR)
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
pred <- prediction(df$time, df$truth)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)
You can also check the AUC value:
auc <- performance(pred, measure = "auc")
[email protected][[1]]
[1] 0.92
Cross checking the AUC value with pROC
library(pROC)
roc(df$truth, df$time)
Call:
roc.default(response = df$truth, predictor = df$time)
Data: df$time in 5 controls (df$truth cleaned) < 5 cases (df$truth final).
Area under the curve: 0.92
For both the cases, it is same!
Upvotes: 4
Reputation: 4427
So my idea was to just create a ROC curve
Creating a ROC curve is as easy as
library(pROC)
set.seed(3)
data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) |>
roc(truth, time) |>
plot()
The problem is that I don't have a machine-learning-like model that would give me class probabilities.
Sorry, I do not understand what is machine-learning-like about the question.
I had the idea of just looping through my data at maybe 100 different thresholds
There is no point in looping over 100 possible thresholds if you got 10 observations. Sensible cutoffs are the nine situated in between your time
values. You can get those from roc
:
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5))
thresholds <- roc(df, truth, time)$thresholds
print(thresholds)
or
> print(thresholds)
[1] -Inf 1.195612 1.739608 1.968531 2.155908 2.329745 2.561073
[8] 3.093424 3.969994 4.586341 Inf
What exactly is implied in the term looping and whether you want to exclude just a for
and a while
loop or whatever exactly you consider to be a loop needs some precise definition. Is c(1, 2, 3, 4) * 5
a loop? There will be a loop running under the hood.
Upvotes: 4