Reputation: 315
I have some data where most of the values (about 10 million in the real data) are close to zero but there are a few outliers. I want to show the distribution with a histogram. For the content and analysis of the data, the outliers are important as well and hence should be visible in the histogram. Using a logarithmic scale on the y-axis
works quite well but there remains a problem. The y-axis
now starts at 1. So bins with exactly one element in them are not drawn and cannot be distinguished from empty bins. Additionally, I get a warning message about infinite values for the empty bin (which is correct, $log(0)=-\infty$).
I made a little code example:
library(ggplot2)
set.seed(123)
data <- data.frame(x=c(abs(rnorm(10000)), 5.25, 5.5, 7.5))
ggplot(data, aes(x)) +
geom_histogram(binwidth=1, boundary=0) +
scale_y_log10()
The two outliers between 5 and 6 are well shown but the one at 7.5 cannot be distinguished from the two empty bins. How do I tell ggplot
to start drawing the bins from a y-value
smaller than 1?
PS: stackoverflow does not allow for mathjax for showing math?
Upvotes: 0
Views: 1324
Reputation: 21
You could use scale_y_sqrt()
as an alternative transformation:
library(ggplot2)
set.seed(123)
data <- data.frame(x=c(abs(rnorm(10000)), 5.25, 5.5, 7.5))
ggplot(data, aes(x)) +
geom_histogram(binwidth=1, boundary=0) +
scale_y_sqrt()
Upvotes: 2