Reputation: 183
I'm struggling to plot a decision boundary in R using ggplot.
I have 2 variables (exam scores) and a binary classification whether a student was admitted to school or not. The data looks like below:
> head(exam.data) Exam1Score Exam2Score Admitted 1 34.62366 78.02469 0 2 30.28671 43.89500 0 3 35.84741 72.90220 0 4 60.18260 86.30855 1 5 79.03274 75.34438 1 6 45.08328 56.31637 0
I can plot the data using ggplot:
exam.plot <- ggplot(data=exam.data, aes(x=Exam1Score, y=Exam2Score, col = ifelse(Admitted == 1,'dark green','red'), size=0.5))+
geom_point()+
labs(x="Exam 1 Scores", y="Exam 2 Scores", title="Exam Scores", colour="Exam Scores")+
theme_bw()+
theme(legend.position="none")
and then successfully fit the logistic regression model:
exam.lm <- glm(data=exam.data, formula=Admitted ~ Exam1Score + Exam2Score, family="binomial")
So after much searching the web, I decided to manually fit the decision boundary (though did try for a while doing this using stat_smooth but couldn't get it to work), I tried the following:
# Fit the decision boundary
plot_x <- c(min(exam.data$Exam1Score)-2, max(exam.data$Exam1Score)+2)
plot_y <- (-1 /coef(exam.lm)[3]) * (coef(exam.lm)[2] * plot_x + coef(exam.lm)[1])
db.data <- data.frame(rbind(plot_x, plot_y))
colnames(db.data) <- c('x','y')
# Add the decision boundary plot
ggplot()+geom_line(data=db.data, aes(x=x, y=y))
which successfully plots the decision boundary, but I can't add it to my existing plot with:
> exam.plot+geom_line(data=db.data, aes(x=x, y=y))
Error: Aesthetics must either be length one, or the same length as the dataProblems:x, y
Can someone point out what I'm doing wrong or whether I can actually do this with +stat_smooth()?
All code (ex2.R) and files are here: https://github.com/StuHorsman/rscripts/tree/master/R/Coursera
Thanks!
Stuart
Update: I can achieve some similar with:
plot(exam.data$Exam1Score, exam.data$Exam2Score, type="n", xlab="Exam 1 Scores", ylab="Exam 2 Scores")
points(exam.data$Exam1Score[exam.data$Admitted==1], exam.data$Exam2Score[exam.data$Admitted==1], pch=4, col="green")
points(exam.data$Exam1Score[exam.data$Admitted==0], exam.data$Exam2Score[exam.data$Admitted==0], pch=4, col="red")
lines(db.data, col="blue")
Upvotes: 2
Views: 4762
Reputation: 1
Why not stat_function?
g=ggplot(exam.data,aes(x=Exam1score,y=Exam2score,col=factor(Admitted)))
g=g+geom_point(size=2.2)+scale_color_discrete(name="Administered")
g=g+stat_function(fun=function(x){(-Intercept-Beta1*x)/Beta2},xlim=c(0,100))
g
Intercept,beta1,beta2 are parameters of the logistic regression function.
Upvotes: 0
Reputation: 14093
The problem is that in exam.plot
you use not only aesthetics x
and y
, but also col
and size
(the latter unnecesarily). The layers need to have all aesthetics set that are defined in the ggplot ()
call. (I've been caught often by that problem).
Thus:
exam.plot+geom_line(data=db.data, aes(x=x, y=y), col = "black", size = 1)
does plot.
However, I'd recommend changing exam.plot
a bit and removing all aesthetics that do not apply for all layers (and put them into the layer definition instead):
exam.plot <- ggplot(data=exam.data, aes(x = Exam1Score, y=Exam2Score))+
geom_point(aes (col = Admitted), size = 0.5)+
scale_color_manual (values = c('red', 'dark green')) +
labs(x="Exam 1 Scores", y="Exam 2 Scores", title="Exam Scores", colour="Exam Scores")+
theme_bw()+
coord_equal () + # assuming that the scores have the same scale.
theme(legend.position="none")
exam.plot + geom_line(data=db.data, aes(x=x, y=y))
Which with example data
exam.data <- data.frame (Exam1Score = rnorm (100) + 0:1,
Exam2Score = rnorm (100) + 0:1,
Admitted = factor (rep (0:1, 50)))
yields:
(plotted with default size, 0.5 would hardly be visible for this example)
Upvotes: 2