Keith W. Larson
Keith W. Larson

Reputation: 1573

R: How to aggregate data into percentages without missing data for stacked-bar plot in ggplot2?

I would like to summarize my "karyotype" molecular data by location and substrate (see sample data below) as percentages in order to create a stack-bar plot in ggplot2.

I have figured out how to use 'dcast' to get a total for each karyotype, but cannot figure out how to get a percent for each of the three karyotypes (i.e. 'BB', 'BD', 'DD').

The data should be in a format to make a stacked bar plot in 'ggplot2'.

Sample Data:

library(reshape2)
Karotype.Data <- structure(list(Location = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L), .Label = c("Kampinge", "Kaseberga", "Molle", "Steninge"
), class = "factor"), Substrate = structure(c(1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 
2L, 2L, 2L, 2L, 2L), .Label = c("Kampinge", "Kaseberga", "Molle", 
"Steninge"), class = "factor"), Karyotype = structure(c(1L, 3L, 
4L, 4L, 3L, 3L, 4L, 4L, 4L, 3L, 1L, 4L, 3L, 4L, 4L, 3L, 1L, 4L, 
3L, 3L, 4L, 3L, 4L, 3L, 3L), .Label = c("", "BB", "BD", "DD"), class = "factor")), .Names = c("Location", 
"Substrate", "Karyotype"), row.names = c(135L, 136L, 137L, 138L, 
139L, 165L, 166L, 167L, 168L, 169L, 236L, 237L, 238L, 239L, 240L, 
326L, 327L, 328L, 329L, 330L, 426L, 427L, 428L, 429L, 430L), class = "data.frame")

## Summary count for each karoytype ##
Karyotype.Summary <- dcast(Karotype.Data , Location + Substrate ~ Karyotype, value.var="Karyotype", length)

Upvotes: 3

Views: 462

Answers (2)

Keith W. Larson
Keith W. Larson

Reputation: 1573

With some help from 'Marat Talipov' and many other answers to questions on Stackoverflow I found out that it is important to load 'plyr' before 'dplyr' and to use 'summarise' rather than 'summarize'. Then removing the missing data was the last step using 'filter'.

library(dplyr)
z.counts <- Karotype.Data %>% 
  group_by(Location,Substrate,Karyotype) %>% 
  summarise(freq=n()) 

z.freq <- z.counts %>% filter(Karyotype != '') %>% 
  group_by(Location,Substrate) %>% 
  mutate(freq=freq/sum(freq))
z.freq

library (ggplot2)
ggplot(z.freq, aes(x=Substrate, y=freq, fill=Karyotype)) +
  geom_bar(stat="identity") +
  facet_wrap(~ Location)

Now I have created the plot I was looking for:

enter image description here

Upvotes: 0

Marat Talipov
Marat Talipov

Reputation: 13304

You can use the dplyr package:

library(dplyr)
z.counts <- Karotype.Data %>% 
  group_by(Location,Substrate,Karyotype) %>% 
  summarize(freq=n()) 

z.freq <- z.counts %>% 
  group_by(Location,Substrate) %>% 
  mutate(freq=freq/sum(freq)*100)

Here, the data remain in the long format, so it is straightforward to build the barplot with ggplot:

library(ggplot2)
ggplot(z.freq) + 
  aes(x=Karyotype,y=freq) + 
  facet_grid(Location~Substrate) + 
  geom_bar(stat='identity')

enter image description here

Upvotes: 1

Related Questions