Revan
Revan

Reputation: 2322

R/data.table: separate columns and count occurrences

I have a large data.table (just showing five lines here).

 taxpath                                                                                                                  N
                  Bacteroidetes; Flavobacteriia; Flavobacteriales; Flavobacteriaceae; Formosa; Formosa sp. Hel3_A1_48;   57
 Bacteroidetes; Flavobacteriia; Flavobacteriales; Cryomorphaceae; NA; Cryomorphaceae bacterium BACL29 MAG-121220-bin8;   54
                Proteobacteria; Alphaproteobacteria; Pelagibacterales; Pelagibacteraceae; Candidatus Pelagibacter; NA;   53
                                                    Proteobacteria; Alphaproteobacteria; Pelagibacterales; NA; NA; NA;   41
                                                      Planctomycetes; NA; NA; NA; NA; Planctomycetes bacterium TMED84;   41

The first column is taxpath (phylum, class, order, family, genus, species from left to right), the second column is N, how often each taxpath occurs.

What I would like to do is to split each taxpath by the semicolon and use the first entry.

And I want to calculate how often each phylum rank (the first rank, so Bacteriodetes, Proteobacteria or Planctomycetes) occurs. However, this number should be multiplied with the value in the column N.

So, what I expect is more or less something like this.

phylum           Nnew
Bacteriodetes    111
Proteobacteria   94
Planctomycetes   41

Can you help me how to do the split within the column and the - I suppose - group-by multiplied with the column N?

(PS: later, I want to do that as well with the other elements in the column taxpath as well, but I thought it is easier to distribute this into seperate tables)

Upvotes: 3

Views: 92

Answers (2)

David Arenburg
David Arenburg

Reputation: 92282

This tagged so here's a simple data.table solution.

library(data.table)
DT[, .(Nnew = sum(N)), by = sub(";.*", "", taxpath)]
#               sub Nnew
# 1:  Bacteroidetes  111
# 2: Proteobacteria   94
# 3: Planctomycetes   41

We basically summed N while extracting the first part of taxpath on the fly in the by statement


Data

DT <- fread("taxpath\t                                                                                                                  N
            Bacteroidetes; Flavobacteriia; Flavobacteriales; Flavobacteriaceae; Formosa; Formosa sp. Hel3_A1_48;\t   57
            Bacteroidetes; Flavobacteriia; Flavobacteriales; Cryomorphaceae; NA; Cryomorphaceae bacterium BACL29 MAG-121220-bin8;\t   54
            Proteobacteria; Alphaproteobacteria; Pelagibacterales; Pelagibacteraceae; Candidatus Pelagibacter; NA;\t   53
            Proteobacteria; Alphaproteobacteria; Pelagibacterales; NA; NA; NA;\t   41
            Planctomycetes; NA; NA; NA; NA; Planctomycetes bacterium TMED84;\t   41")

Upvotes: 2

akrun
akrun

Reputation: 886938

We can use separate to split the 'taxpath' into the specified columns based on the delimiter ;, grouped by 'phylum', get the sum of 'N'

library(tidyverse) 
newcols <-c("phylum", "class", "order", "family", "genus", "species")
df1 %>% 
   mutate(taxpath = sub(";$", "", taxpath)) %>%
   separate(taxpath, into = newcols, sep= ";\\s*") %>% 
   group_by(phylum) %>%
   summarise(Nnew = sum(N))
# A tibble: 3 x 2
#   phylum          Nnew
#   <chr>          <int>
# 1 Bacteroidetes    326
# 2 Planctomycetes    41
# 3 Proteobacteria    94

Upvotes: 1

Related Questions