How to filter using grep

Question

I have a taxonomy file which is structured like this:

Can I use something like grep (I have no experience here) to remove elements from the taxon column?

For example, instead of:

D_0__Bacteria;D_1__Fusobacteria;D_2__Fusobacteriia;D_3__Fusobacteriales;D_4__Fusobacteriaceae;D_5__Fusobacterium

Could i remove everything before and after "Fusobacterium" so it only says:

Fusobacterium

Some of the rows go to species level so I would need to remove details after the 5th level of identification. For example:

Change:

D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Pasteurellales;D_4__Pasteurellaceae;D_5__Haemophilus;D_6__Pasteurellaceae bacterium canine oral taxon 272

To:

Haemophilus

user8617947 · Accepted Answer

This should do the trick:

sample <- "D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Pasteurellales;D_4__Pasteurellaceae;D_5__Haemophilus;D_6__Pasteurellaceae bacterium canine oral taxon 272"

sub(".*D_5__([A-Za-z]*);.*", "\1", sample)
# [1] "Haemophilus"

Explanation

We are matching the whole string and capturing the alphabetical pattern between D_5__ and ;. Then we are telling sub() only to return the captured pattern.

How to filter using grep

Answers (1)

Explanation

Related Questions