Reputation: 825
I have the following dataframe (subset):
kingdom phylum class order family genus
1 Bacteria unknown unknown unknown unknown unknown
2 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Bacillus
3 Bacteria unknown unknown unknown unknown unknown
4 Bacteria Firmicutes Bacilli Bacillales Listeriaceae Listeria
5 Bacteria unknown unknown unknown unknown unknown
6 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Bacillus
7 unknown unknown unknown unknown unknown unknown
tax <- structure(list(kingdom = c("Bacteria", "Bacteria", "Bacteria",
"Bacteria", "Bacteria", "Bacteria", "unknown", "Bacteria", "Bacteria",
"Bacteria"), phylum = c("unknown", "Firmicutes", "unknown", "Firmicutes",
"unknown", "Firmicutes", "unknown", "Firmicutes", "Firmicutes",
"Firmicutes"), class = c("unknown", "Bacilli", "unknown", "Bacilli",
"unknown", "Bacilli", "unknown", "Bacilli", "Bacilli", "Bacilli"
), order = c("unknown", "Bacillales", "unknown", "Bacillales",
"unknown", "Bacillales", "unknown", "Bacillales", "Bacillales",
"Bacillales"), family = c("unknown", "Bacillaceae", "unknown",
"Listeriaceae", "unknown", "Bacillaceae", "unknown", "Bacillaceae",
"Bacillaceae", "Staphylococcaceae"), genus = c("unknown", "Bacillus",
"unknown", "Listeria", "unknown", "Bacillus", "unknown", "Bacillus",
"Bacillus", "Staphylococcus"), species = c("uncultured bacterium",
"Bacillus subtilis", "unknown", "Listeria monocytogenes", "uncultured bacterium",
"Bacillus subtilis", "metagenome", "Bacillus subtilis", "Bacillus subtilis",
"Staphylococcus aureus")), row.names = c(NA, 10L), class = "data.frame")
cols <- colnames(tax)
Each row can duplicated so i´m counting unique rows as follows and adding a frequency for each row:
df2 <- tax %>%
group_by(.dots=cols) %>%
summarise(counts = n()) %>%
mutate(relative_abundance=( counts/sum(counts)))
> df2
# A tibble: 6 x 9
# Groups: kingdom, phylum, class, order, family, genus [5]
kingdom phylum class order family genus species counts relative_abunda…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl>
1 Bacteria Firmicu… Bacil… Bacilla… Bacillaceae Bacillus Bacillus sub… 4 1
2 Bacteria Firmicu… Bacil… Bacilla… Listeriace… Listeria Listeria mon… 1 1
3 Bacteria Firmicu… Bacil… Bacilla… Staphyloco… Staphyl… Staphylococc… 1 1
4 Bacteria unknown unkno… unknown unknown unknown uncultured b… 2 0.667
5 Bacteria unknown unkno… unknown unknown unknown unknown 1 0.333
6 unknown unknown unkno… unknown unknown unknown metagenome 1 1
I was expecting unique rows with percent of each bacteria/unknown between 0 and 100. What´s wrong ??
For example we have 4 Bacillus subtilis in total. The sum of counts is 10. So 4/10*100=40%
Upvotes: 0
Views: 28
Reputation: 125228
You forgot to ungroup()
after summarise
, i.e. sum(counts)
gives the sum for each group. Try this:
df2 <- tax %>%
group_by(.dots=cols) %>%
summarise(counts = n()) %>%
ungroup() %>%
mutate(relative_abundance = counts / sum(counts))
Or more concise: Use count
instead of group_by + summarise + ungroup
:
df2 <- tax %>%
count(.dots = cols) %>%
mutate(relative_abundance = n / sum(n))
Stefan
Upvotes: 2