Reputation: 155
I'm using ggplot
/ easyGgplot2
to create density plots of two groups. I would like have a metric or indication of how much intersection there is between the two curves. I might even use any other solution without the curves, as long as it allows me to have a measure of which groups are more distinct (of several different groups of data).
Is there any easy way to do this in R?
For example using this sample, which generates this plot
How can I estimate the percentage of area that is common to both?
ggplot2.density(data=weight, xName='weight', groupName='sex',
legendPosition="top",
alpha=0.5, fillGroupDensity=TRUE )
Upvotes: 3
Views: 2119
Reputation: 131
I like the previous answer, but this may be a bit more intuitive, also I made sure to use a common bandwidth:
library ( "caTools" )
# Extract common bandwidth
Bw <- ( density ( iris$Petal.Width ))$bw
# Get iris data
Sample <- with ( iris, split ( Petal.Width, Species ))[ 2:3 ]
# Estimate kernel densities using common bandwidth
Densities <- lapply ( Sample, density,
bw = bw,
n = 512,
from = -1,
to = 3 )
# Plot
plot( Densities [[ 1 ]], xlim = c ( -1, 3 ),
col = "steelblue",
main = "" )
lines ( Densities [[ 2 ]], col = "orange" )
# Overlap
X <- Densities [[ 1 ]]$x
Y1 <- Densities [[ 1 ]]$y
Y2 <- Densities [[ 2 ]]$y
Overlap <- pmin ( Y1, Y2 )
polygon ( c ( X, X [ 1 ]), c ( Overlap, Overlap [ 1 ]),
lwd = 2, col = "hotpink", border = "n", density = 20)
# Integrate
Total <- trapz ( X, Y1 ) + trapz ( X, Y2 )
(Surface <- trapz ( X, Overlap ) / Total)
SText <- paste ( sprintf ( "%.3f", 100*Surface ), "%" )
text ( X [ which.max ( Overlap )], 1.2 * max ( Overlap ), SText )
Upvotes: 4
Reputation: 32466
First, make some data to use. Here, we will look at the Petal Width of two plant species from the built-in iris
dataset.
## Some sample data from iris
dat <- droplevels(with(iris, iris[Species %in% c("versicolor", "virginica"), ]))
## make a similar graph
library(ggplot2)
ggplot(dat, aes(Petal.Width, fill=Species)) +
geom_density(alpha=0.5)
To find the area of the intersection, you can use approxfun
to approximate the function describing the overlap. Then, integrate it the get the area. Since these are density curves, their area is 1 (ish) so the integral will be the percentage overlap.
## Get density curves for each species
ps <- lapply(split(dat, dat$Species), function(x) {
dens <- density(x$Petal.Width)
data.frame(x=dens$x, y=dens$y)
})
## Approximate the functions and find intersection
fs <- sapply(ps, function(x) approxfun(x$x, x$y, yleft=0, yright=0))
f <- function(x) fs[[1]](x) - fs[[2]](x) # function to minimize (difference b/w curves)
meet <- uniroot(f, interval=c(1, 2))$root # intersection of the two curves
## Find overlapping x, y values
ps1 <- is.na(cut(ps[[1]]$x, c(-Inf, meet)))
ps2 <- is.na(cut(ps[[2]]$x, c(Inf, meet)))
shared <- rbind(ps[[1]][ps1,], ps[[2]][ps2,])
## Approximate function of intersection
f <- with(shared, approxfun(x, y, yleft=0, yright=0))
## have a look
xs <- seq(0, 3, len=1000)
plot(xs, f(xs), type="l", col="blue", ylim=c(0, 2))
points(ps[[1]], col="red", type="l", lty=2, lwd=2)
points(ps[[2]], col="blue", type="l", lty=2, lwd=2)
polygon(c(xs, rev(xs)), y=c(f(xs), rep(0, length(xs))), col="orange", density=40)
## Integrate it to get the value
integrate(f, lower=0, upper=3)$value
# [1] 0.1548127
Upvotes: 6