Reputation: 5719
I have strings that look like this below
tt <- c("16S_M_T1_R1_S1_S50_R1_001.fastq.gz", "16S_M_T1_R1_S1_S50_R2_001.fastq.gz",
"16S_M_T1_R1_S2_S62_R1_001.fastq.gz")
I want to delete everything before the 5th _
and everything after the 6th _
.
The result I want is:
S50, S50, S62
I can do this in multiple steps by doing something like sub("^(.*?_.*?_.*?_.*?_.*?_.*?)_.*", "\\1", tt)
, but I was wondering if there is a better one-step method to do this.
Upvotes: 2
Views: 631
Reputation: 887118
We can use sub
by placing an anchor for the start (^
) followed by 5 instances of characters that are not a _
([^_]+
) followed by a _
and then capture the characters that are not a _
(([^_]+)
). In the replacement, specify the second capture group (\\2
)
sub("^([^_]+_){5}([^_]+).*", "\\2", tt)
#[1] "S50" "S50" "S62"
Upvotes: 3
Reputation: 50678
You could use strsplit
sapply(strsplit(tt, "_"), "[[", 6)
#[1] "S50" "S50" "S62"
Explanation: We use vectorised strsplit
to split tt
on every "_"
resulting in a list
; sapply(..., "[[", 6)
then extracts the 6th element from every list
element.
Alternatively you could use an explicit anonymous function
sapply(strsplit(tt, "_"), function(x) x[6])
Upvotes: 3