Reputation: 2583
I have row.names that look like this:
Input:
S1_S2_S3_S9_AAACTGATFSRYB
S3_S4_S12_S1_TTTTTTGATFSRYB
S9_S4_S12_S1_S2_S19_S22_GTGTTTGATFSRYB
and I would like the following:
S9_AAACTGATFSRYB
S1_TTTTTTGATFSRYB
S22_GTGTTTGATFSRYB
In other words I would like to retain only the last S* before letters start. I have totally 6000 rows
Can anyone help me please to write a gsub or something like that to extract the string I need?
Upvotes: 3
Views: 274
Reputation: 2797
try this
a =c(
"S1_S2_S3_S9_AAACTGATFSRYB",
"S3_S4_S12_S1_TTTTTTGATFSRYB",
"S9_S4_S12_S1_S2_S19_S22_GTGTTTGATFSRYB"
)
gsub("^.*_(.*_.+)$","\\1",a)
#> [1] "S9_AAACTGATFSRYB" "S1_TTTTTTGATFSRYB" "S22_GTGTTTGATFSRYB"
Created on 2018-07-18 by the reprex package (v0.2.0.9000).
Edit: add explanation on regex:
^.*_
matches string from the start(^
) to the last underscore(.*_.+)
matches string with an underscore, combining with the first part, it will match the last underscore and the surroundings, which is what we want.()
and \1
:
The backreference \N, where N = 1 ... 9, matches the substring previously matched by the Nth parenthesized subexpression of the regular expression.
Upvotes: 5
Reputation: 11480
alternative using regex and stringr
stringr::str_extract(a,"[^_]+_[^_]+$")
#[1] "S9_AAACTGATFSRYB" "S1_TTTTTTGATFSRYB" "S22_GTGTTTGATFSRYB"
Upvotes: 0
Reputation: 56149
Non regex solution:
sapply(strsplit(a, "_"), function(i) paste(tail(i, n = 2), collapse = "_"))
# [1] "S9_AAACTGATFSRYB" "S1_TTTTTTGATFSRYB" "S22_GTGTTTGATFSRYB"
Upvotes: 1