Reputation: 387
When I plot densities with ggplot, it seems to be very wrong around the limits. I see that geom_density
and other functions allow specifying various density kernels, but none of them seem to fix the issue.
How do you correctly plot densities around the limits with ggplot?
As an example, let's plot the Chi-square distribution with 2 degrees of freedom. Using the builtin probability densities:
library(ggplot2)
u = seq(0, 2, by=0.01)
v = dchisq(u, df=2)
df = data.frame(x=u, p=v)
p = ggplot(df) +
geom_line(aes(x=x, y=p), size=1) +
theme_classic() +
coord_cartesian(xlim=c(0, 2), ylim=c(0, 0.5))
show(p)
We get the expected plot:
Now let's try simulating it and plotting the empirical distribution:
library(ggplot2)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_density(aes(x=x)) +
theme_classic() +
coord_cartesian(xlim=c(0, 2))
show(p)
We get an incorrect plot:
We can try to visualize the actual distribution:
library(ggplot2, dplyr, tidyr)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_point(aes(x=x, y=0.5), position=position_jitter(height=0.2), shape='.', alpha=1) +
theme_classic() +
coord_cartesian(xlim=c(0, 2), ylim=c(0, 1))
show(p)
And it seems to look correct, contrary to the density plot:
It seems like the problem has to do with kernels, and geom_density
does allow using different kernels. But they don't really correct the limit problem. For example, the code above with triangular
looks about the same:
Here's an idea of what I'm expecting to see (of course, I want a density, not a histogram):
library(ggplot2)
u = rchisq(10000, df=2)
df = data.frame(x=u)
p = ggplot(df) +
geom_histogram(aes(x=x), center=0.1, binwidth=0.2, fill='white', color='black') +
theme_classic() +
coord_cartesian(xlim=c(0, 2))
show(p)
Upvotes: 2
Views: 756
Reputation: 263342
The usual kernel density methods have trouble when there is a constraint such as in this case for a density with only support above zero. The usual recommendation for handling this has been to use the logspline package:
install.packages("logspline")
library(logspline)
png(); fit <- logspline(rchisq(10000, 3))
plot(fit) ; dev.off()
If this needed to be done in the ggplot2 environment there is a dlogspline function:
densdf <- data.frame( y=dlogspline(seq(0,12,length=1000), fit),
x=seq(0,12,length=1000))
ggplot(densdf, aes(y=y,x=x))+geom_line()
Perhaps you were insisting on one with 2 degrees of freedom?
Upvotes: 2