Reputation: 819
I have some data with predictors and a binary target. Eg:
df <- data.frame(a=sort(sample(1:100,30)), b= sort(sample(1:100,30)),
target=c(rep(0,11),rep(1,4),rep(0,4),rep(1,11)))
I trained a logistic regresion model using glm()
model1 <- glm(formula= target ~ a + b, data=df, family=binomial)
Now I'm trying to predict the output (for the example, the same data should suffice)
predict(model1, newdata=df, type="response")
This generates a vector of probability numbers. But I want to predict the actual class. I could use round() on the probablity numbers, but this assumes that anything below 0.5 is class '0', and anything above is class '1'. Is this a correct assumption? Even when the population of each class may not be equal (or close to equal)? Or is there a way to estimate this threshold?
Upvotes: 11
Views: 40040
Reputation: 10891
predict.glm
only has type = c("link", "response", "terms")
, and "response"
is predicted class probabilities, almost always the one you want. It's standard for logistic regression to use 0.5 as a threshold, as argmax conditional class probability, argmax_y P(Y_train=y | X_train=x) is the Bayes (optimal) classifier. Parsnip's predict()
for logistic_reg()
does this by default. The conditional probability is not the same as the proportion of your response variable in the input data, which is marginal P(Y_train = y), not dependent on X_train.
However, you don't have to use 0.5 as a threshold, since you can tune the threshold based on any performance metric from the class predictions, such as maximizing F1 score or balancing precision/recall.
Upvotes: 0
Reputation: 1268
To get the threshold in the data with the closest sensitivity and specificity values (ie. the cross over in the graphs above) programmatically you can you this code which gets pretty close:
predictions = prediction(PREDS, LABELS)
sens = cbind(unlist(performance(predictions, "sens")@x.values), unlist(performance(predictions, "sens")@y.values))
spec = cbind(unlist(performance(predictions, "spec")@x.values), unlist(performance(predictions, "spec")@y.values))
sens[which.min(apply(sens, 1, function(x) min(colSums(abs(t(spec) - x))))), 1]
Upvotes: 2
Reputation: 33
There are 12 methods implemented in the function PresenceAbsence::optimal.thresholds
of PresenceAbsence
package.
This is also covered in Freeman, E. A., & Moisen, G. G. (2008). A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa. Ecological Modelling, 217(1-2), 48-58.
Upvotes: 1
Reputation: 1009
Tooled around trying to replicate first graph. Given a predictions <- prediction(pred,labels)
object, then:
baseR approach
plot(unlist(performance(predictions, "sens")@x.values), unlist(performance(predictions, "sens")@y.values),
type="l", lwd=2, ylab="Specificity", xlab="Cutoff")
par(new=TRUE)
plot(unlist(performance(predictions, "spec")@x.values), unlist(performance(predictions, "spec")@y.values),
type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2),labels=z)
mtext("Specificity",side=4, padj=-2, col='red')
ggplot2 approach
sens <- data.frame(x=unlist(performance(predictions, "sens")@x.values),
y=unlist(performance(predictions, "sens")@y.values))
spec <- data.frame(x=unlist(performance(predictions, "spec")@x.values),
y=unlist(performance(predictions, "spec")@y.values))
sens %>% ggplot(aes(x,y)) +
geom_line() +
geom_line(data=spec, aes(x,y,col="red")) +
scale_y_continuous(sec.axis = sec_axis(~., name = "Specificity")) +
labs(x='Cutoff', y="Sensitivity") +
theme(axis.title.y.right = element_text(colour = "red"), legend.position="none")
Upvotes: 4
Reputation: 1
You can try the below:
perfspec <- performance(prediction.obj = pred, measure="spec", x.measure="cutoff")
plot(perfspec)
par(new=TRUE)
perfsens <- performance(prediction.obj = pred, measure="sens", x.measure="cutoff")
plot(perfsens)
Upvotes: -3
Reputation: 7121
The best threshold (or cutoff) point to be used in glm models is the point which maximises the specificity and the sensitivity. This threshold point might not give the highest prediction in your model, but it wouldn't be biased towards positives or negatives. The ROCR
package contain functions that can help you do this. check the performance()
function in this package. It is going to get you what you're looking for. Here's a picture of what you are expecting to get:
After finding the cutoff point, I normally write a function myself to find the number of datapoints that has their prediction value above the cutoff, and match it with the group they belong to.
Upvotes: 9
Reputation: 75545
The gold standard for determining good model parameters, including "what threshold should I set" for logistic regression, is cross-validation.
The general idea is to hold out one or more parts of your training set and choose the threshold that maximizes the number of correct classifications on this held-out set, but Wikipedia can give you many more details.
Upvotes: 5