Reputation: 2322
I have a large data.table
(just showing five lines here).
taxpath N
Bacteroidetes; Flavobacteriia; Flavobacteriales; Flavobacteriaceae; Formosa; Formosa sp. Hel3_A1_48; 57
Bacteroidetes; Flavobacteriia; Flavobacteriales; Cryomorphaceae; NA; Cryomorphaceae bacterium BACL29 MAG-121220-bin8; 54
Proteobacteria; Alphaproteobacteria; Pelagibacterales; Pelagibacteraceae; Candidatus Pelagibacter; NA; 53
Proteobacteria; Alphaproteobacteria; Pelagibacterales; NA; NA; NA; 41
Planctomycetes; NA; NA; NA; NA; Planctomycetes bacterium TMED84; 41
The first column is taxpath
(phylum, class, order, family, genus, species from left to right), the second column is N
, how often each taxpath occurs.
What I would like to do is to split each taxpath by the semicolon and use the first entry.
And I want to calculate how often each phylum rank (the first rank, so Bacteriodetes, Proteobacteria or Planctomycetes) occurs. However, this number should be multiplied with the value in the column N.
So, what I expect is more or less something like this.
phylum Nnew
Bacteriodetes 111
Proteobacteria 94
Planctomycetes 41
Can you help me how to do the split within the column and the - I suppose - group-by multiplied with the column N?
(PS: later, I want to do that as well with the other elements in the column taxpath as well, but I thought it is easier to distribute this into seperate tables)
Upvotes: 3
Views: 92
Reputation: 92282
This tagged data.table so here's a simple data.table solution.
library(data.table)
DT[, .(Nnew = sum(N)), by = sub(";.*", "", taxpath)]
# sub Nnew
# 1: Bacteroidetes 111
# 2: Proteobacteria 94
# 3: Planctomycetes 41
We basically summed N
while extracting the first part of taxpath
on the fly in the by
statement
Data
DT <- fread("taxpath\t N
Bacteroidetes; Flavobacteriia; Flavobacteriales; Flavobacteriaceae; Formosa; Formosa sp. Hel3_A1_48;\t 57
Bacteroidetes; Flavobacteriia; Flavobacteriales; Cryomorphaceae; NA; Cryomorphaceae bacterium BACL29 MAG-121220-bin8;\t 54
Proteobacteria; Alphaproteobacteria; Pelagibacterales; Pelagibacteraceae; Candidatus Pelagibacter; NA;\t 53
Proteobacteria; Alphaproteobacteria; Pelagibacterales; NA; NA; NA;\t 41
Planctomycetes; NA; NA; NA; NA; Planctomycetes bacterium TMED84;\t 41")
Upvotes: 2
Reputation: 886938
We can use separate
to split the 'taxpath' into the specified columns based on the delimiter ;
, grouped by 'phylum', get the sum
of 'N'
library(tidyverse)
newcols <-c("phylum", "class", "order", "family", "genus", "species")
df1 %>%
mutate(taxpath = sub(";$", "", taxpath)) %>%
separate(taxpath, into = newcols, sep= ";\\s*") %>%
group_by(phylum) %>%
summarise(Nnew = sum(N))
# A tibble: 3 x 2
# phylum Nnew
# <chr> <int>
# 1 Bacteroidetes 326
# 2 Planctomycetes 41
# 3 Proteobacteria 94
Upvotes: 1