Reputation: 121
I am working in R and and trying to extract part of a character string separated with underscores, including an underscore:
WRAP_384_p1_QC1_8
WRAP_384_p3_QC1_7
I wish to obtain an output like this:
1_QC1
3_QC1
What regex do I need to extract this information?
Upvotes: 1
Views: 4880
Reputation: 887118
We can use gsub
to match one or more characters (.*
) followed by a _
followed by a lower case letter ([a-z]
) or |
a _
followed by one or more numbers (\\d+
) until the end ($
) of the string and replace it with blanks (""
).
gsub(".*_[a-z]|_\\d+$", "", str1)
#[1] "1_QC1" "3_QC1"
Or use sub
with capture groups to match two instances of one or more not a underscore followed by a underscore (([^_]+_){2}
) from the start (^
) of the string followed by a lower case letter ([a-z]
), and then we capture the group within the brackets ((...)
) for one or more numbers (\\d+
) followed by _
and one or more alpha numeric characters ([[:alnum:]]+
) close the capture group bracket followed by underscore (_
) and one or more numbers (\\d+
). We replace it with the second capture group (\\2
).
sub("^([^_]+_){2}[a-z](\\d+_[[:alnum:]]+)_\\d+", "\\2", str1)
#[1] "1_QC1" "3_QC1"
str1 <- c("WRAP_384_p1_QC1_8", "WRAP_384_p3_QC1_7")
Upvotes: 6