Reputation: 49
I have a dataset that looks like this, where "1" represents if a host is infected and "0" represents if a host is uninfected at that specified dose. However, the ROC function needs observed data, false positive and true positives to generate the ROC curve. I think that that I am missing a step or miscalculated something but I'm not sure what it is.
library(pROC)
dataname <- data.frame(Dose = c(rep(0.2, 8), rep(0.3, 7), rep(0.7, 10)),
Infected = c(rep(0, 20), rep(1, 5)))
I used GLM to get the probability of each host getting infected at each dose size.
#logistic model
logistic <- glm(
formula = Infected ~ Dose,
data = dataname,
family = binomial(link = 'logit')
)
I then ordered the probabilities from lowest to highest and ranked them:
predicted.data<-data.frame(prob.inf = logistic$fitted.values, Infected = dataname$Infected)
predicted.data<-predicted.data[order(predicted.data$prob.inf, decreasing=FALSE),]
predicted.data$rank<-1:nrow(predicted.data)
I then ran the roc function and plotted the curve:
roc_data <-roc(dataname$Infected, predicted.data$prob.inf)
plot(roc_data, main="ROC Curve", print.auc=TRUE, xlim=(0:1), ylim=(0:1))
Upvotes: 1
Views: 462
Reputation: 898
You do not need to order and rank the predicted probabilites.
Assuming you are using the roc()
function from the pROC
package you can simply feed it your response dataname$Infected
and your fitted values logistic$fitted.values
.
The following code:
library(pROC)
dataname <- data.frame(Dose = c(rep(0.2,8),rep(0.3,7), rep(0.7,10)),
Infected = c(rep(0,20),rep(1,5)))
logistic <- glm(
formula = Infected~Dose,
data = dataname,
family = binomial(link = 'logit')
)
predicted.data<-data.frame(prob.inf=logistic$fitted.values,Infected=dataname$Infected)
roc_data <-roc(dataname$Infected,predicted.data$prob.inf)
plot(roc_data, main="ROC Curve", print.auc=TRUE,
xlab = "Specificity (true negative rate)", ylab = "Sensitivity (true positive rate)")
produces:
Which seems correct to me.
Upvotes: 1
Reputation: 20399
For a true understanding of model diagnostics it is IMHO quite enlightening to calculate some metrics by hand (which is not overly complicated). In a logistic regression setting you would start with the confusion matrix and get the relavant metrics from there.
Here's a worked out example:
#### Use Challenger Data as a sample data for GLM
data(Challeng, package = "alr4")
c_mod <- glm(fail > 0 ~ temp, data = Challeng, family = "binomial")
### Do calculations by hand
## 1. Create observed vs prediction data.frame
obs_pred <- data.frame(fail = as.integer(Challeng$fail > 0),
pred = predict(c_mod, type = "response"))
## 2. Get all potential cutoff values
cs <- c(0, sort(unique(obs_pred$pred)))
## 3. Calculate all potential confusion matrices (i.e. 2x2 observed vs predicted
cms <- lapply(cs, \(co) table(data.frame(obs = factor(as.integer(Challeng$fail > 0), 1:0),
pred = factor(as.integer(obs_pred$pred > co), 1:0))))
## 4. Get True Positive Rate (tpr) and False Positive Rate (fpr)
tpr <- vapply(cms, \(tab) tab[1L, 1L] / sum(tab[1L, ]), numeric(1L))
fpr <- vapply(cms, \(tab) tab[2L, 1L] / sum(tab[2L, ]), numeric(1L))
## 5. Plot fpr vs tpr
plot(fpr, tpr, type = "l")
Now that this is understood, we can use built-in libraries to do the same. There are quite a couple of them (nice comparison can be found here). one option is library(ROCR)
:
library(ROCR)
## 1. Create 'prediction' object (c.f. ?ROCR::prediction)
pp_c <- with(obs_pred, prediction(pred, fail))
## 3. Get True Positive Rate (tpr) and False Positive Rate (fpr)
perf_c <- performance(pp_c, "tpr", "fpr")
## 4. Plot
plot(perf_c)
## 5. Same as by hand calculation
all.equal(rev(fpr), [email protected][[1L]])
# [1] TRUE
all.equal(rev(tpr), [email protected][[1L]])
# [1] TRUE
From the confusion matrix (and a bit of basic geometry) you can also calculate the area under the curve:
### AUC Calculations
sw <- cbind(fpr = rev(fpr), tpr = rev(tpr))
sum(diff(sw[, "fpr"]) * (sw[-nrow(sw), "tpr"] + diff(sw[, "tpr"]) / 2))
# [1] 0.78125
performance(pp_c, "auc")@y.values[[1L]]
# [1] 0.78125
Upvotes: 1