Achal Neupane
Achal Neupane

Reputation: 5719

How to remove string before and after certain delimiter positions in R?

I have strings that look like this below

tt <- c("16S_M_T1_R1_S1_S50_R1_001.fastq.gz", "16S_M_T1_R1_S1_S50_R2_001.fastq.gz", 
"16S_M_T1_R1_S2_S62_R1_001.fastq.gz")

I want to delete everything before the 5th _ and everything after the 6th _. The result I want is: S50, S50, S62

I can do this in multiple steps by doing something like sub("^(.*?_.*?_.*?_.*?_.*?_.*?)_.*", "\\1", tt), but I was wondering if there is a better one-step method to do this.

Upvotes: 2

Views: 631

Answers (2)

akrun
akrun

Reputation: 887118

We can use sub by placing an anchor for the start (^) followed by 5 instances of characters that are not a _ ([^_]+) followed by a _ and then capture the characters that are not a _ (([^_]+)). In the replacement, specify the second capture group (\\2)

sub("^([^_]+_){5}([^_]+).*", "\\2", tt)
#[1] "S50" "S50" "S62"

Upvotes: 3

Maurits Evers
Maurits Evers

Reputation: 50678

You could use strsplit

sapply(strsplit(tt, "_"), "[[", 6)
#[1] "S50" "S50" "S62"

Explanation: We use vectorised strsplit to split tt on every "_" resulting in a list; sapply(..., "[[", 6) then extracts the 6th element from every list element.

Alternatively you could use an explicit anonymous function

sapply(strsplit(tt, "_"), function(x) x[6])

Upvotes: 3

Related Questions