Reputation: 728
I am currently trying to crack the seemingly simple problem in R but somehow I am unable to find a way to do it with gsub
, str_match()
or some other rgex
-related functions. Can anyone please help me crack this problem?
Problem Assuming that I have a column vector of certain length (say, 100). Each element in a vector has the form of [string]_[string+number]_[someinfo]
. Now, I want to extract only the very first part of each element, namely the [string]_[string+number]
. The potential upper bound on the number of characters in [string]_[string+number]
, not including _
, could be anywhere between 8 and 20, but there is no fixed length. How can I use some types of rgex
expression to do this in R?
x = c('XY_ABCD101_12_ACE', 'XZ_ACC122_100_BAN', 'XT_AAEEE100_12345_ABC', 'XKY_BBAAUUU124_100')
Desired output.
x1 = c('XY_ABCD101', 'XZ_ACC122', 'XT_AAEEE100', 'XKY_BBAAUUU124')
Upvotes: 2
Views: 1069
Reputation: 163352
You might use a pattern to assert 9-21 chars to the right including the underscore, then the match the first 2 parts with the single underscore:
^(?=\\w{9,21}_[A-Z0-9])[A-Z]+_[A-Z0-9]+
Explanation
^
Start of string(?=
Positive lookahead, assert what is to the right of the current location is
\\w{9,21}_[A-Z0-9]
Match 9-21 word chars followed by an underscore and a char A-Z or a digit)
Close the lookahead[A-Z]+
Match 1+ chars A-Z_
Match the first underscore[A-Z0-9]+
Match 1+ chars A-Z or a digitx = c('XY_ABCD101_12_ACE', 'XZ_ACC122_100_BAN', 'XT_AAEEE100_12345_ABC', 'XKY_BBAAUUU124_100')
regmatches(x, regexpr("^(?=\\w{9,21}_[A-Z0-9])[A-Z]+_[A-Z0-9]+", x, perl = TRUE))
Output
[1] "XY_ABCD101" "XZ_ACC122" "XT_AAEEE100" "XKY_BBAAUUU124"
Upvotes: 2
Reputation: 101373
Since your intended output strings always end with the last digital before _
, you can try pattern (?<=\\d)(?=_)
to find the position and remove the chars that follows
> gsub("(?<=\\d)(?=_).*$","",x,perl = TRUE)
[1] "XY_ABCD101" "XZ_ACC122" "XT_AAEEE100" "XKY_BBAAUUU124"
Upvotes: 1
Reputation: 887118
An option with str_remove
library(stringr)
str_remove(x, "_\\d+.*")
[1] "XY_ABCD101" "XZ_ACC122" "XT_AAEEE100" "XKY_BBAAUUU124"
Upvotes: 2
Reputation: 487
library(stringr)
str_extract(x, "[:alnum:]+_[:alnum:]+(?=_)")
[1] "XY_ABCD101" "XZ_ACC122"
[3] "XT_AAEEE100" "XKY_BBAAUUU124"
Upvotes: 3
Reputation: 78927
We could use str_extract
from stringr
package with the regex that matches to remove everything after the second underscore:
library(stringr)
str_extract(x, "[^_]*_[^_]*")
[1] "XY_ABCD101" "XZ_ACC122" "XT_AAEEE100" "XKY_BBAAUUU124"
Upvotes: 4
Reputation: 4425
Try this
regmatches(x , regexpr("\\D+_\\D+\\d+" , x))
[1] "XY_ABCD101" "XZ_ACC122" "XT_AAEEE100"
[4] "XKY_BBAAUUU124"
Upvotes: 2