david
david

Reputation: 825

Relative frequency with dplyr unexpected output

I have the following dataframe (subset):

    kingdom     phylum   class      order            family          genus
1  Bacteria    unknown unknown    unknown           unknown        unknown
2  Bacteria Firmicutes Bacilli Bacillales       Bacillaceae       Bacillus
3  Bacteria    unknown unknown    unknown           unknown        unknown
4  Bacteria Firmicutes Bacilli Bacillales      Listeriaceae       Listeria
5  Bacteria    unknown unknown    unknown           unknown        unknown
6  Bacteria Firmicutes Bacilli Bacillales       Bacillaceae       Bacillus
7   unknown    unknown unknown    unknown           unknown        unknown




tax <- structure(list(kingdom = c("Bacteria", "Bacteria", "Bacteria", 
"Bacteria", "Bacteria", "Bacteria", "unknown", "Bacteria", "Bacteria", 
"Bacteria"), phylum = c("unknown", "Firmicutes", "unknown", "Firmicutes", 
"unknown", "Firmicutes", "unknown", "Firmicutes", "Firmicutes", 
"Firmicutes"), class = c("unknown", "Bacilli", "unknown", "Bacilli", 
"unknown", "Bacilli", "unknown", "Bacilli", "Bacilli", "Bacilli"
), order = c("unknown", "Bacillales", "unknown", "Bacillales", 
"unknown", "Bacillales", "unknown", "Bacillales", "Bacillales", 
"Bacillales"), family = c("unknown", "Bacillaceae", "unknown", 
"Listeriaceae", "unknown", "Bacillaceae", "unknown", "Bacillaceae", 
"Bacillaceae", "Staphylococcaceae"), genus = c("unknown", "Bacillus", 
"unknown", "Listeria", "unknown", "Bacillus", "unknown", "Bacillus", 
"Bacillus", "Staphylococcus"), species = c("uncultured bacterium", 
"Bacillus subtilis", "unknown", "Listeria monocytogenes", "uncultured bacterium", 
"Bacillus subtilis", "metagenome", "Bacillus subtilis", "Bacillus subtilis", 
"Staphylococcus aureus")), row.names = c(NA, 10L), class = "data.frame")


 cols <- colnames(tax)

Each row can duplicated so i´m counting unique rows as follows and adding a frequency for each row:

 df2 <- tax %>% 
        group_by(.dots=cols) %>%
        summarise(counts = n())  %>%
        mutate(relative_abundance=( counts/sum(counts)))


> df2
# A tibble: 6 x 9
# Groups:   kingdom, phylum, class, order, family, genus [5]
  kingdom  phylum   class  order    family      genus    species       counts relative_abunda…
  <chr>    <chr>    <chr>  <chr>    <chr>       <chr>    <chr>          <int>            <dbl>
1 Bacteria Firmicu… Bacil… Bacilla… Bacillaceae Bacillus Bacillus sub…      4            1    
2 Bacteria Firmicu… Bacil… Bacilla… Listeriace… Listeria Listeria mon…      1            1    
3 Bacteria Firmicu… Bacil… Bacilla… Staphyloco… Staphyl… Staphylococc…      1            1    
4 Bacteria unknown  unkno… unknown  unknown     unknown  uncultured b…      2            0.667
5 Bacteria unknown  unkno… unknown  unknown     unknown  unknown            1            0.333
6 unknown  unknown  unkno… unknown  unknown     unknown  metagenome         1            1

I was expecting unique rows with percent of each bacteria/unknown between 0 and 100. What´s wrong ??

For example we have 4 Bacillus subtilis in total. The sum of counts is 10. So 4/10*100=40%

Upvotes: 0

Views: 28

Answers (1)

stefan
stefan

Reputation: 125228

You forgot to ungroup() after summarise, i.e. sum(counts) gives the sum for each group. Try this:

df2 <- tax %>% 
  group_by(.dots=cols) %>%
  summarise(counts = n())  %>%
  ungroup() %>% 
  mutate(relative_abundance = counts / sum(counts))

Or more concise: Use count instead of group_by + summarise + ungroup:

df2 <- tax %>% 
    count(.dots = cols)  %>%
    mutate(relative_abundance = n / sum(n))

Stefan

Upvotes: 2

Related Questions