Panda
Panda

Reputation: 155

Intersection between density plots of multiple groups

I'm using ggplot / easyGgplot2 to create density plots of two groups. I would like have a metric or indication of how much intersection there is between the two curves. I might even use any other solution without the curves, as long as it allows me to have a measure of which groups are more distinct (of several different groups of data).

Is there any easy way to do this in R?

For example using this sample, which generates this plot

enter image description here

How can I estimate the percentage of area that is common to both?

ggplot2.density(data=weight, xName='weight', groupName='sex',
    legendPosition="top",
    alpha=0.5, fillGroupDensity=TRUE )

Upvotes: 3

Views: 2119

Answers (2)

David.
David.

Reputation: 131

I like the previous answer, but this may be a bit more intuitive, also I made sure to use a common bandwidth:

library ( "caTools" )

# Extract common bandwidth
Bw <- ( density ( iris$Petal.Width ))$bw

# Get iris data
Sample <- with ( iris, split ( Petal.Width, Species ))[ 2:3 ]

# Estimate kernel densities using common bandwidth
Densities <- lapply ( Sample, density,
                      bw = bw,
                      n = 512,
                      from = -1,
                      to = 3 )

# Plot
plot( Densities [[ 1 ]], xlim = c ( -1, 3 ),
      col = "steelblue",
      main = "" )
lines ( Densities [[ 2 ]], col = "orange" )

# Overlap
X <- Densities [[ 1 ]]$x
Y1 <- Densities [[ 1 ]]$y
Y2 <- Densities [[ 2 ]]$y

Overlap <- pmin ( Y1, Y2 )
polygon ( c ( X, X [ 1 ]), c ( Overlap, Overlap [ 1 ]),
    lwd = 2, col = "hotpink", border = "n", density = 20) 

# Integrate
Total <- trapz ( X, Y1 ) + trapz ( X, Y2 )
(Surface <- trapz ( X, Overlap ) / Total)
SText <- paste ( sprintf ( "%.3f", 100*Surface ), "%" )
text ( X [ which.max ( Overlap )], 1.2 * max ( Overlap ), SText )

Overlap of densities of versicolor and virginica petal widths

Upvotes: 4

Rorschach
Rorschach

Reputation: 32466

First, make some data to use. Here, we will look at the Petal Width of two plant species from the built-in iris dataset.

## Some sample data from iris
dat <- droplevels(with(iris, iris[Species %in% c("versicolor", "virginica"), ]))

## make a similar graph
library(ggplot2)
ggplot(dat, aes(Petal.Width, fill=Species)) +
  geom_density(alpha=0.5)

enter image description here

To find the area of the intersection, you can use approxfun to approximate the function describing the overlap. Then, integrate it the get the area. Since these are density curves, their area is 1 (ish) so the integral will be the percentage overlap.

## Get density curves for each species
ps <- lapply(split(dat, dat$Species), function(x) {
    dens <- density(x$Petal.Width)
    data.frame(x=dens$x, y=dens$y)
})

## Approximate the functions and find intersection
fs <- sapply(ps, function(x) approxfun(x$x, x$y, yleft=0, yright=0))
f <- function(x) fs[[1]](x) - fs[[2]](x)   # function to minimize (difference b/w curves)
meet <- uniroot(f, interval=c(1, 2))$root  # intersection of the two curves

## Find overlapping x, y values
ps1 <- is.na(cut(ps[[1]]$x, c(-Inf, meet)))
ps2 <- is.na(cut(ps[[2]]$x, c(Inf, meet)))
shared <- rbind(ps[[1]][ps1,], ps[[2]][ps2,])

## Approximate function of intersection
f <- with(shared, approxfun(x, y, yleft=0, yright=0))

## have a look
xs <- seq(0, 3, len=1000)
plot(xs, f(xs), type="l", col="blue", ylim=c(0, 2))

points(ps[[1]], col="red", type="l", lty=2, lwd=2)
points(ps[[2]], col="blue", type="l", lty=2, lwd=2)

polygon(c(xs, rev(xs)), y=c(f(xs), rep(0, length(xs))), col="orange", density=40)

enter image description here

## Integrate it to get the value
integrate(f, lower=0, upper=3)$value
# [1] 0.1548127

Upvotes: 6

Related Questions