Reputation: 728

Extract part of the strings with specific format

I am currently trying to crack the seemingly simple problem in R but somehow I am unable to find a way to do it with gsub, str_match() or some other rgex-related functions. Can anyone please help me crack this problem?

Problem Assuming that I have a column vector of certain length (say, 100). Each element in a vector has the form of [string]_[string+number]_[someinfo]. Now, I want to extract only the very first part of each element, namely the [string]_[string+number]. The potential upper bound on the number of characters in [string]_[string+number], not including _, could be anywhere between 8 and 20, but there is no fixed length. How can I use some types of rgex expression to do this in R?

x = c('XY_ABCD101_12_ACE', 'XZ_ACC122_100_BAN', 'XT_AAEEE100_12345_ABC', 'XKY_BBAAUUU124_100')

Desired output.

x1 = c('XY_ABCD101', 'XZ_ACC122', 'XT_AAEEE100', 'XKY_BBAAUUU124')

Upvotes: 2

Answers (6)

The fourth bird

Reputation: 163352

You might use a pattern to assert 9-21 chars to the right including the underscore, then the match the first 2 parts with the single underscore:

^(?=\\w{9,21}_[A-Z0-9])[A-Z]+_[A-Z0-9]+

Explanation

^ Start of string
(?= Positive lookahead, assert what is to the right of the current location is
- \\w{9,21}_[A-Z0-9] Match 9-21 word chars followed by an underscore and a char A-Z or a digit
) Close the lookahead
[A-Z]+ Match 1+ chars A-Z
_ Match the first underscore
[A-Z0-9]+ Match 1+ chars A-Z or a digit

Regex demo | R demo

x = c('XY_ABCD101_12_ACE', 'XZ_ACC122_100_BAN', 'XT_AAEEE100_12345_ABC', 'XKY_BBAAUUU124_100')
regmatches(x, regexpr("^(?=\\w{9,21}_[A-Z0-9])[A-Z]+_[A-Z0-9]+", x, perl = TRUE))

Output

[1] "XY_ABCD101"     "XZ_ACC122"      "XT_AAEEE100"    "XKY_BBAAUUU124"

Upvotes: 2

ThomasIsCoding

Reputation: 101373

Since your intended output strings always end with the last digital before _, you can try pattern (?<=\\d)(?=_) to find the position and remove the chars that follows

> gsub("(?<=\\d)(?=_).*$","",x,perl = TRUE)
[1] "XY_ABCD101"     "XZ_ACC122"      "XT_AAEEE100"    "XKY_BBAAUUU124"

Upvotes: 1

akrun

Reputation: 887118

An option with str_remove

library(stringr)
str_remove(x, "_\\d+.*")
[1] "XY_ABCD101"     "XZ_ACC122"      "XT_AAEEE100"    "XKY_BBAAUUU124"

Upvotes: 2

Chemist learns to code

Reputation: 487

library(stringr)
str_extract(x, "[:alnum:]+_[:alnum:]+(?=_)")

[1] "XY_ABCD101"     "XZ_ACC122"     
[3] "XT_AAEEE100"    "XKY_BBAAUUU124"

Upvotes: 3

TarJae

Reputation: 78927

We could use str_extract from stringr package with the regex that matches to remove everything after the second underscore:

library(stringr)
str_extract(x, "[^_]*_[^_]*")

[1] "XY_ABCD101"     "XZ_ACC122"      "XT_AAEEE100"    "XKY_BBAAUUU124"

Upvotes: 4

Mohamed Desouky

Reputation: 4425

Try this

regmatches(x , regexpr("\\D+_\\D+\\d+" , x))

Output

[1] "XY_ABCD101"     "XZ_ACC122"      "XT_AAEEE100"   
[4] "XKY_BBAAUUU124"

Upvotes: 2

Extract part of the strings with specific format

Answers (6)

Related Questions