mrad
mrad

Reputation: 303

How to filter using grep

I have a taxonomy file which is structured like this:

enter image description here

Can I use something like grep (I have no experience here) to remove elements from the taxon column?

For example, instead of:

D_0__Bacteria;D_1__Fusobacteria;D_2__Fusobacteriia;D_3__Fusobacteriales;D_4__Fusobacteriaceae;D_5__Fusobacterium

Could i remove everything before and after "Fusobacterium" so it only says:

Fusobacterium

Some of the rows go to species level so I would need to remove details after the 5th level of identification. For example:

Change:

D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Pasteurellales;D_4__Pasteurellaceae;D_5__Haemophilus;D_6__Pasteurellaceae bacterium canine oral taxon 272

To:

Haemophilus

Upvotes: 0

Views: 94

Answers (1)

user8617947
user8617947

Reputation: 164

This should do the trick:

sample <- "D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Pasteurellales;D_4__Pasteurellaceae;D_5__Haemophilus;D_6__Pasteurellaceae bacterium canine oral taxon 272"

sub(".*D_5__([A-Za-z]*);.*", "\\1", sample)
# [1] "Haemophilus"

Explanation

We are matching the whole string and capturing the alphabetical pattern between D_5__ and ;. Then we are telling sub() only to return the captured pattern.

Upvotes: 1

Related Questions