Reputation: 123
The question seems simple enough. I can plot 95% credible/confidence ellipses in ggplot
using stat_ellipse
. I want to find out how to test whether a given point is within that ellipse.
For example, here is a dataset of points and their 95% ellipses, plus a set of randomvalues
. I'd like a function that calculates TRUE or FALSE whether the randompalues
falls within the ellipse.
library(tidyverse)
set.seed(420)
randomvalues <- data.frame(x = sample(min(mtcars$hp):max(mtcars$hp), 10),
y = sample(min(mtcars$disp):max(mtcars$disp), 10))
ggplot(data = mtcars)+
geom_point(aes(x = hp, y = disp, color = as.factor(cyl)))+
stat_ellipse(aes(x = hp, y = disp, color = as.factor(cyl)))+
geom_point(data = randomvalues, aes(x = x, y = y))
This (very contrived) solution comes close, but when points are close to the border of an ellipse, it fails. The test point here is 310,300, which is just outside of ellipse 8. This function says it falls within ellipse 8.
point_in_ellipse <- function(x, y, points, level = 0.95) {
# Calculate mean of points
mean_x <- mean(points[, 1])
mean_y <- mean(points[, 2])
# Calculate covariance matrix of points
cov_mat <- cov(points)
# Calculate eigenvalues and eigenvectors of covariance matrix
eigen <- eigen(cov_mat)
# Calculate semi-major and semi-minor axes of ellipse
lambda_1 <- sqrt(eigen$values[1] * qchisq(level, df = 2))
lambda_2 <- sqrt(eigen$values[2] * qchisq(level, df = 2))
# Calculate orientation of ellipse
theta <- atan2(eigen$vectors[2, 1], eigen$vectors[1, 1])
# Convert point to ellipse coordinates
dx <- x - mean_x
dy <- y - mean_y
xp <- dx * cos(theta) + dy * sin(theta)
yp <- -dx * sin(theta) + dy * cos(theta)
# Calculate distance from point to ellipse center
d <- sqrt((xp / lambda_1)^2 + (yp / lambda_2)^2)
# Check if point is within ellipse
if (d <= 1) {
return(TRUE)
} else {
return(FALSE)
}
}
# Split mtcars by cylinder
ellipse.points <- mtcars %>%
select(hp, disp, cyl) %>%
split(f = .$cyl)
# Add a randomvalue that is on the border of an ellipse
randomvalues <- rbind(randomvalues, c(310, 300))
# Make a list to receive the answers
answer <- list()
# For each cylinder, calculate the ellipse, and calcualte whether the randomvalues fall inside it
for(i in names(ellipse.points)){
answer[[i]] <- apply(randomvalues, 1, function(x) point_in_ellipse(x = x[1], y = x[2], ellipse.points[[i]]) )
}
# Convert back to df
final.df <- cbind(randomvalues, do.call(cbind, answer))
I think stat_ellipse
must not use the chi squared distribution or the covariance matrix, or must use another way of calculating its ellipses.
Any ideas?
EDIT: 8 May 2022
Based on suggestions from @tjebo, I tried to calculate the ellipses with car::dataEllipse()
, then convert the ellipses to sf
objects, then use st_intersects
to find which points intersect which ellipses.
# Calculate ellipses for each cyl using car::dataEllipse
list.of.cyls <- split(mtcars[,c(4,3)], f = as.factor(mtcars$cyl))
cyl.ellipses <- lapply(list.of.cyls, function(x) car::dataEllipse(x = x$hp, y = x$disp, draw = FALSE) %>%
.$`0.95` %>%
as_tibble(.) %>%
st_as_sf(coords = c("x", "y")) %>%
summarise(geometry = st_combine(geometry)) %>%
st_cast("POLYGON")
)
# Convert randomvalues to sf
randomvalues.sf <- randomvalues %>%
st_as_sf(coords = c("x", "y"))
# Already the ellipses are off
ggplot()+
geom_sf(data = do.call(bind_rows, cyl.ellipses), fill = NA)+
geom_point(data = randomvalues, aes(x = x, y = y))
ggplot()+
stat_ellipse(data = mtcars, aes(x = hp, y = disp, color = as.factor(cyl)))+
geom_point(data = randomvalues, aes(x = x, y = y))
Before we do any intersection calculations, the ellipses are already off. Not sure how my calculation of ellipses using dataEllipses
varies from stat_ellipse
, but somehow it does.
Anyone have an idea?
Upvotes: 2
Views: 293
Reputation: 226557
short answer: car::dataEllipse
is using an F-statistic cutoff (analogous to a Student t-distribution cutoff, taking the uncertainty of the variance estimate into account). In contrast, you're using a cutoff based on a chi-squared distribution, analogous to Normal cutoffs, assuming the variance is known exactly (i.e., the sampling variance is equal to the true variance).
From that function:
radius <- sqrt(dfn * qf(level, dfn, dfd))
result[[i]] <- ellipse(center, shape, radius, ...)
where shape
is the covariance matrix.
car:::ellipse
uses
angles <- (0:segments) * 2 * pi/segments
unit.circle <- cbind(cos(angles), sin(angles))
Q <- chol(shape, pivot = TRUE)
order <- order(attr(Q, "pivot"))
ellipse <- t(center + radius * t(unit.circle %*% Q[, order]))
Upvotes: 1