StatsBoy
StatsBoy

Reputation: 55

ggplot draw multiple plots by levels of a variable

I have a sample dataset

d=data.frame(n=rep(c(1,1,1,1,1,1,2,2,2,3),2),group=rep(c("A","B"),each=20),stringsAsFactors = F)

And I want to draw two separate histograms based on group variable.

I tried this method suggested by @jenesaisquoi in a separate post here Generating Multiple Plots in ggplot by Factor

ggplot(data=d)+geom_histogram(aes(x=n,y=..count../sum(..count..)),binwidth = 1)+facet_wrap(~group)

Histogram output

It did the trick but if you look closely, the proportions are wrong. It didn't calculate the proportion for each group but rather a grand proportion. I want the proportion to be 0.6 for number 1 for each group, not 0.3.

Then I tried dplyr package, and it didn't even create two graphs. It ignored the group_by command. Except the proportion is right this time.

d%>%group_by(group)%>%ggplot(data=.)+geom_histogram(aes(x=n,y=..count../sum(..count..)),binwidth = 1)

dplyr output

Finally I tried factoring with color

ggplot(data=d)+geom_histogram(aes(x=n,y=..count../sum(..count..),color=group),binwidth = 1)

But the result is far from ideal. I was going to accept one output but with the bins side by side, not on top of each other.

color=group output

In conclusion, I want to draw two separate histograms with correct proportions calculated within each group. If there is no easy way to do this, I can live with one graph but having the bins side by side, and with correct proportions for each group. In this example, number 1 should have 0.6 as its proportion.

Upvotes: 5

Views: 1292

Answers (2)

Luis
Luis

Reputation: 639

By changing ..count../sum(..count..) to ..density.., it gives you the desired proportion

ggplot(data=d) + geom_histogram(aes(x=n, y=..density..), binwidth = 1) + facet_wrap(~group)

Upvotes: 2

AcademicDialysis
AcademicDialysis

Reputation: 178

You actually have the separation of charts by variable correct! Especially with ggplot, you sometimes need to consider the scales of the graph separately from the shape. Facet_wrap applies a new layer to your data, regardless of scale. It will behave the same, no matter what your axes are. You could also try adding scale_y_log10() as a layer, and you'll notice that the overall shape and style of your graph is the same, you've just changed the axes.

What you actually need is a fix to your scales. Understandable - frequency plots can be confusing. ..count../sum(..count..)) treats each bin as an independent unit, regardless of its value. See a good explanation of this here: Show % instead of counts in charts of categorical variables

What you want is ..density.., which is basically the count divided by the total count. The difference is subtle in principle, but the important bit is that the value on the x-axis matters. For an extreme case of this, see here: Normalizing y-axis in histograms in R ggplot to proportion, where tiny x-axis values produced huge densities.

Your original code will still work, just substituting the aesthetics I described above.

ggplot(data=d)+geom_histogram(aes(x=n,y=..density..,)binwidth = 1)+facet_wrap(~group)

If you're still confused about density, so are lots of people. Hadley Wickham wrote a long piece about it, you can find that here: http://vita.had.co.nz/papers/density-estimation.pdf

Upvotes: 0

Related Questions