Wassinger
Wassinger

Reputation: 387

How to make density plot correctly show area near the limits?

When I plot densities with ggplot, it seems to be very wrong around the limits. I see that geom_density and other functions allow specifying various density kernels, but none of them seem to fix the issue.

How do you correctly plot densities around the limits with ggplot?


As an example, let's plot the Chi-square distribution with 2 degrees of freedom. Using the builtin probability densities:

library(ggplot2)

u = seq(0, 2, by=0.01)
v = dchisq(u, df=2)

df = data.frame(x=u, p=v)

p = ggplot(df) +
    geom_line(aes(x=x, y=p), size=1) +
    theme_classic() +
    coord_cartesian(xlim=c(0, 2), ylim=c(0, 0.5))

show(p)

We get the expected plot:

enter image description here

Now let's try simulating it and plotting the empirical distribution:

library(ggplot2)

u = rchisq(10000, df=2)

df = data.frame(x=u)

p = ggplot(df) +
    geom_density(aes(x=x)) +
    theme_classic() +
    coord_cartesian(xlim=c(0, 2))

show(p)

We get an incorrect plot:

enter image description here

We can try to visualize the actual distribution:

library(ggplot2, dplyr, tidyr)

u = rchisq(10000, df=2)

df = data.frame(x=u)

p = ggplot(df) +
    geom_point(aes(x=x, y=0.5), position=position_jitter(height=0.2), shape='.', alpha=1) +
    theme_classic() +
    coord_cartesian(xlim=c(0, 2), ylim=c(0, 1))

show(p)

And it seems to look correct, contrary to the density plot:

enter image description here

It seems like the problem has to do with kernels, and geom_density does allow using different kernels. But they don't really correct the limit problem. For example, the code above with triangular looks about the same:

enter image description here

Here's an idea of what I'm expecting to see (of course, I want a density, not a histogram):

library(ggplot2)

u = rchisq(10000, df=2)

df = data.frame(x=u)

p = ggplot(df) +
    geom_histogram(aes(x=x), center=0.1, binwidth=0.2, fill='white', color='black') +
    theme_classic() +
    coord_cartesian(xlim=c(0, 2))

show(p)

enter image description here

Upvotes: 2

Views: 756

Answers (1)

IRTFM
IRTFM

Reputation: 263342

The usual kernel density methods have trouble when there is a constraint such as in this case for a density with only support above zero. The usual recommendation for handling this has been to use the logspline package:

install.packages("logspline")
library(logspline)
png(); fit <- logspline(rchisq(10000, 3))       
plot(fit) ; dev.off()

enter image description here

If this needed to be done in the ggplot2 environment there is a dlogspline function:

densdf <- data.frame( y=dlogspline(seq(0,12,length=1000), fit), 
                      x=seq(0,12,length=1000))

ggplot(densdf, aes(y=y,x=x))+geom_line()

Perhaps you were insisting on one with 2 degrees of freedom?

enter image description here

Upvotes: 2

Related Questions