Alex Krohn
Alex Krohn

Reputation: 123

Function to Calculate Whether A Point is Within Ellipse using stat_ellipse

The question seems simple enough. I can plot 95% credible/confidence ellipses in ggplot using stat_ellipse. I want to find out how to test whether a given point is within that ellipse.

For example, here is a dataset of points and their 95% ellipses, plus a set of randomvalues. I'd like a function that calculates TRUE or FALSE whether the randompalues falls within the ellipse.

library(tidyverse)

set.seed(420)

randomvalues <- data.frame(x = sample(min(mtcars$hp):max(mtcars$hp), 10),
                           y = sample(min(mtcars$disp):max(mtcars$disp), 10))

ggplot(data = mtcars)+
  geom_point(aes(x = hp, y = disp, color = as.factor(cyl)))+
  stat_ellipse(aes(x = hp, y = disp, color = as.factor(cyl)))+
  geom_point(data = randomvalues, aes(x = x, y = y))

This (very contrived) solution comes close, but when points are close to the border of an ellipse, it fails. The test point here is 310,300, which is just outside of ellipse 8. This function says it falls within ellipse 8.

point_in_ellipse <- function(x, y, points, level = 0.95) {
  
  # Calculate mean of points
  mean_x <- mean(points[, 1])
  mean_y <- mean(points[, 2])
  
  # Calculate covariance matrix of points
  cov_mat <- cov(points)
  
  # Calculate eigenvalues and eigenvectors of covariance matrix
  eigen <- eigen(cov_mat)
  
  # Calculate semi-major and semi-minor axes of ellipse
  lambda_1 <- sqrt(eigen$values[1] * qchisq(level, df = 2))
  lambda_2 <- sqrt(eigen$values[2] * qchisq(level, df = 2))
  
  # Calculate orientation of ellipse
  theta <- atan2(eigen$vectors[2, 1], eigen$vectors[1, 1])
  
  # Convert point to ellipse coordinates
  dx <- x - mean_x
  dy <- y - mean_y
  xp <- dx * cos(theta) + dy * sin(theta)
  yp <- -dx * sin(theta) + dy * cos(theta)
  
  # Calculate distance from point to ellipse center
  d <- sqrt((xp / lambda_1)^2 + (yp / lambda_2)^2)
  
  # Check if point is within ellipse
  if (d <= 1) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

# Split mtcars by cylinder
ellipse.points <- mtcars %>%
  select(hp, disp, cyl) %>%
  split(f = .$cyl)

# Add a randomvalue that is on the border of an ellipse
randomvalues <- rbind(randomvalues, c(310, 300))

# Make a list to receive the answers
answer <- list()

# For each cylinder, calculate the ellipse, and calcualte whether the randomvalues fall inside it
for(i in names(ellipse.points)){
  answer[[i]] <- apply(randomvalues, 1, function(x) point_in_ellipse(x = x[1], y = x[2], ellipse.points[[i]]) )
}

# Convert back to df
final.df <- cbind(randomvalues, do.call(cbind, answer))

I think stat_ellipse must not use the chi squared distribution or the covariance matrix, or must use another way of calculating its ellipses.

Any ideas?

EDIT: 8 May 2022 Based on suggestions from @tjebo, I tried to calculate the ellipses with car::dataEllipse(), then convert the ellipses to sf objects, then use st_intersects to find which points intersect which ellipses.

# Calculate ellipses for each cyl using car::dataEllipse
list.of.cyls <- split(mtcars[,c(4,3)], f = as.factor(mtcars$cyl))

cyl.ellipses <- lapply(list.of.cyls, function(x) car::dataEllipse(x = x$hp, y = x$disp, draw = FALSE) %>%
                           .$`0.95` %>%
                           as_tibble(.) %>%
                           st_as_sf(coords = c("x", "y")) %>%
                           summarise(geometry = st_combine(geometry)) %>%
                           st_cast("POLYGON")
)

# Convert randomvalues to sf
randomvalues.sf <- randomvalues %>%
  st_as_sf(coords = c("x", "y"))

# Already the ellipses are off
ggplot()+
  geom_sf(data = do.call(bind_rows, cyl.ellipses), fill = NA)+
  geom_point(data = randomvalues, aes(x = x, y = y))

ggplot()+
  stat_ellipse(data = mtcars, aes(x = hp, y = disp, color = as.factor(cyl)))+
  geom_point(data = randomvalues, aes(x = x, y = y))

Before we do any intersection calculations, the ellipses are already off. Not sure how my calculation of ellipses using dataEllipses varies from stat_ellipse, but somehow it does.

Anyone have an idea?

Upvotes: 2

Views: 293

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226557

short answer: car::dataEllipse is using an F-statistic cutoff (analogous to a Student t-distribution cutoff, taking the uncertainty of the variance estimate into account). In contrast, you're using a cutoff based on a chi-squared distribution, analogous to Normal cutoffs, assuming the variance is known exactly (i.e., the sampling variance is equal to the true variance).

From that function:

radius <- sqrt(dfn * qf(level, dfn, dfd))
result[[i]] <- ellipse(center, shape, radius, ...)

where shape is the covariance matrix.

car:::ellipse uses

angles <- (0:segments) * 2 * pi/segments
unit.circle <- cbind(cos(angles), sin(angles))
Q <- chol(shape, pivot = TRUE)
order <- order(attr(Q, "pivot"))
ellipse <- t(center + radius * t(unit.circle %*% Q[, order]))

Upvotes: 1

Related Questions