Reputation: 47
Hi i am pretty new to programming in R and i am having troble plotting a ROC curve without using any package.
I generated my data using:
d=rpearsonIII(100,0.5,360,20)
nd=rnorm(100,450,25)
i need a vector with values <400 for d and >400 for nd, so i did:
spec = (cumsum(nd[nd>400])/sum(nd))*100
sens = (cumsum(d[d<400])/sum(nd))*100
and the i plotted like this:
plot(1-spec,sens)
but the plot was nothing like i expected it to be
Edit: Thanks to the advice given my code looks like this now:
sc2 = c(rnorm(50,450,25),rpearsonIII(50,0.5,360,20))
scF = sc2 < 395
thresholds <- sort(sc2)
pos <- sum(scF);pos
neg <- sum(!scF);neg
tn <- cumsum(!scF);tn
spec <- tn/neg;spec
tp <- pos - cumsum(scF);tp
sens <- tp/pos;sens
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
The plotted roc curve looks like this: roc curve
My problem now is that if change the order of the generated data (rnorm and rpearsonIII), the curve is reversed.
Upvotes: 3
Views: 3869
Reputation: 2094
I don't know what rpearsonIII
is, so I am just going to make a sample random data with actual classes actuals
as well as the scores for the predictions scores
.
set.seed(100)
actuals <- sample(c(TRUE,FALSE), 100, replace = TRUE)
scores <- runif(100,-1,1)
If in your data the actuals are strings or factors rather than logicals, you will need to convert them to logicals using:
actuals <- actuals == "postiveClass"
Next we want to order the instances based on their scores. We can do this using:
actuals <- actuals[order(scores)]
If you want to keep track of the thresholds for the sensitivities and specificity, you can keep them aligned using:
thresholds <- sort(scores)
Now we need to get our sensitivities and specificities. Sensitivity is TP/P
and specificity is TN/N
. Getting the total number of positives P
is easy, since our actuals are logical, we can just use sum(actuals)
. Similarity, we can get our negatives N
using sum(!actuals)
.
pos <- sum(actuals)
neg <- sum(!actuals)
First lets get our true negatives at each threshold. That is pretty easy, it is just the number of FALSE
values at or below each the threshold. Since our data are in order by threshold, we can calculate that (and the specificity) using:
tn <- cumsum(!actuals)
spec <- tn/neg
The number of true positives is slightly harder because we are looking for the number of positives greater that the threshold, so cumsum
alone won't work. However, since the number above the threshold is equal to the total minus number below or at the threshold, we can get our true positives using:
tp <- pos - cumsum(actuals)
sens <- tp/pos
Now all we need to do is plot the two.
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
To get the AUC of the curve, we simply need to calculate the height of the curve (the sensitivity) multiplied by the width the (difference in 1 - specificity) at each value of actuals. We already have the sensitivity, we just need the specificity. The diff
function will give us our difference in adjacent values of specificity, however, we need to put a 0
value at the beginning to get the width of the first columns.
width <- diff(c(0, 1 - sens))
auc <- sum(spec*width)
actuals <- actuals[order(scores)]
sens <- (sum(actuals) - cumsum(actuals))/sum(actuals)
spec <- cumsum(!actuals)/sum(!actuals)
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
(auc <- sum(spec*diff(c(0, 1 - sens))))
Upvotes: 6