user248237
user248237

Reputation:

normalizing ggplot2 densities with facet_wrap in R

I am making a series of density plots with geom_density from a dataframe, and showing it by condition using facet_wrap, as in:

ggplot(iris) + geom_density(aes(x=Sepal.Width, colour=Species, y=..count../sum(..count..))) + facet_wrap(~Species)

When I do this, the y-axis scale seems to not represent percent of each Species in a panel, but rather the percent of all the total datapoints across all species.

My question is: How can I make it so the ..count.. variable in geom_density refers to the count of items in each Species set of each panel, so that the panel for virginica has a y-axis corresponding to "Fraction of virginica data points"?

Also, is there a way to get ggplot2 to output the values it uses for ..count.. and sum(..count..) so that I can verify what numbers it is using?

edit: I misunderstood geom_density it looks like even for a single Species, ..count../sum(..count..) is not a percentage:

ggplot(iris[iris$Species == 'virginica',]) + geom_density(aes(x=Sepal.Width, colour=Species, y=..count../sum(..count..))) + facet_wrap(~Species)

so my revised question: how can I get the density plot to be the fraction of data in each bin? Do I have to use stat_density for this or geom_histogram? I just want the y-axis to be percentage / fraction of data points

Upvotes: 5

Views: 8739

Answers (3)

a Data Head
a Data Head

Reputation: 61

Passing the argument scales='free_y' to facet_wrap() should do the trick.

Upvotes: 1

sc_evans
sc_evans

Reputation: 2902

Unfortunately, what you are asking ggplot2 to do is define separate y's for each facet, which it syntactically cannot do AFAIK.

So, in response to your mentioning in the comment thread that you "just want a histogram fundamentally", I would suggest instead using geom_histogram or, if you're partial to lines instead of bars, geom_freqpoly:

ggplot(iris, aes(Sepal.Width, ..count..)) + 
  geom_histogram(aes(colour=Species, fill=Species), binwidth=.2) +
  geom_freqpoly(colour="black", binwidth=.2) +
  facet_wrap(~Species)

enter image description here

**Note: geom_freqpoly works just as well in place of geom_histogram in my above example. I just added both in one plot for sake of efficiency.

Hope this helps.

EDIT: Alright, I managed to work out a quick-and-dirty way of getting what you want. It requires that you install and load plyr. Apologies in advance; this is likely not the most efficient way to do this in terms of RAM usage, but it works.

First, let's get iris out in the open (I use RStudio so I'm used to seeing all my objects in a window):

d <- iris

Now, we can use ddply to count the number of individuals belonging to each unique measurement of what will become your x-axis (here I used Sepal.Length instead of Sepal.Width, to give myself a bit more range, simply for seeing a bigger difference between groups when plotted).

new <- ddply(d, c("Species", "Sepal.Length"), summarize, count=length(Sepal.Length))

Note that ddply automatically sorts the output data.frame according to the quoted variables.

Then we can divvy up the data.frame into each of its unique conditions--in the case of iris, each of the three species (I'm sure there's a much smoother way to go about this, and if you're working with really large amounts of data it's not advisable to keep creating subsets of the same data.frame because you could max out your RAM)...

set <- new[which(new$Species%in%"setosa"),]
ver <- new[which(new$Species%in%"versicolor"),]
vgn <- new[which(new$Species%in%"virginica"),]

... and use ddply again to calculate proportions of individuals falling under each measurement, but separately for each species.

prop <- rbind(ddply(set, c("Species"), summarize, prop=set$count/sum(set$count)),
              ddply(ver, c("Species"), summarize, prop=ver$count/sum(ver$count)),
              ddply(vgn, c("Species"), summarize, prop=vgn$count/sum(vgn$count)))

Then we just put everything we need into one dataset and remove all the junk from our workspace.

new$prop <- prop$prop
rm(list=ls()[which(!ls()%in%c("new", "d"))])

And we can make our figure with facet-specific proportions on the y. Note that I'm now using geom_line since ddply has automatically ordered your data.frame.

ggplot(new, aes(Sepal.Length, prop)) + 
  geom_line(aes(colour=new$Species)) +
  facet_wrap(~Species)

facet_wrap with facet-specific proportions

# let's check our work. each should equal 50
sum(new$count[which(new$Species%in%"setosa")]) 
sum(new$count[which(new$Species%in%"versicolor")]) 
sum(new$count[which(new$Species%in%"versicolor")])

#... and each of these should equal 1
sum(new$prop[which(new$Species%in%"setosa")]) 
sum(new$prop[which(new$Species%in%"versicolor")]) 
sum(new$prop[which(new$Species%in%"versicolor")])

Upvotes: 6

marbel
marbel

Reputation: 7714

Maybe using table() and barplot() you might be able to get what you need. I'm still not sure if this is what you are after...

barplot(table(iris[iris$Species == 'virginica',1]))

With ggplot2

tb <- table(iris[iris$Species == 'virginica',1])
tb <- as.data.frame(tb)
ggplot(tb, aes(x=Var1, y=Freq)) + geom_bar()

Upvotes: 0

Related Questions