Reputation: 2583

Subset a part of a string before a certain pattern

I have row.names that look like this:

Input:

 S1_S2_S3_S9_AAACTGATFSRYB
 S3_S4_S12_S1_TTTTTTGATFSRYB
 S9_S4_S12_S1_S2_S19_S22_GTGTTTGATFSRYB

and I would like the following:

 S9_AAACTGATFSRYB
 S1_TTTTTTGATFSRYB
 S22_GTGTTTGATFSRYB

In other words I would like to retain only the last S* before letters start. I have totally 6000 rows

Can anyone help me please to write a gsub or something like that to extract the string I need?

Upvotes: 3

Answers (3)

TC Zhang

Reputation: 2797

try this

a =c(
"S1_S2_S3_S9_AAACTGATFSRYB",
"S3_S4_S12_S1_TTTTTTGATFSRYB",
"S9_S4_S12_S1_S2_S19_S22_GTGTTTGATFSRYB"
)


gsub("^.*_(.*_.+)$","\\1",a)
#> [1] "S9_AAACTGATFSRYB"   "S1_TTTTTTGATFSRYB"  "S22_GTGTTTGATFSRYB"

Created on 2018-07-18 by the reprex package (v0.2.0.9000).

Edit: add explanation on regex:

^.*_ matches string from the start(^) to the last underscore
(.*_.+) matches string with an underscore, combining with the first part, it will match the last underscore and the surroundings, which is what we want.
() and \1:

The backreference \N, where N = 1 ... 9, matches the substring previously matched by the Nth parenthesized subexpression of the regular expression.

Upvotes: 5

Andre Elrico

Reputation: 11480

alternative using regex and stringr

stringr::str_extract(a,"[^_]+_[^_]+$")
#[1] "S9_AAACTGATFSRYB"   "S1_TTTTTTGATFSRYB"  "S22_GTGTTTGATFSRYB"

Upvotes: 0

zx8754

Reputation: 56149

Non regex solution:

sapply(strsplit(a, "_"), function(i) paste(tail(i, n = 2), collapse = "_"))
# [1] "S9_AAACTGATFSRYB"   "S1_TTTTTTGATFSRYB"  "S22_GTGTTTGATFSRYB"

Upvotes: 1

Subset a part of a string before a certain pattern

Answers (3)

Related Questions