Reputation: 303
I have a taxonomy file which is structured like this:
Can I use something like grep (I have no experience here) to remove elements from the taxon column?
For example, instead of:
D_0__Bacteria;D_1__Fusobacteria;D_2__Fusobacteriia;D_3__Fusobacteriales;D_4__Fusobacteriaceae;D_5__Fusobacterium
Could i remove everything before and after "Fusobacterium" so it only says:
Fusobacterium
Some of the rows go to species level so I would need to remove details after the 5th level of identification. For example:
Change:
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Pasteurellales;D_4__Pasteurellaceae;D_5__Haemophilus;D_6__Pasteurellaceae bacterium canine oral taxon 272
To:
Haemophilus
Upvotes: 0
Views: 94
Reputation: 164
This should do the trick:
sample <- "D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Pasteurellales;D_4__Pasteurellaceae;D_5__Haemophilus;D_6__Pasteurellaceae bacterium canine oral taxon 272"
sub(".*D_5__([A-Za-z]*);.*", "\\1", sample)
# [1] "Haemophilus"
We are matching the whole string and capturing the alphabetical pattern between D_5__
and ;
. Then we are telling sub()
only to return the captured pattern.
Upvotes: 1