Ahmet Mera
Ahmet Mera

Reputation: 61

RegEx syntax for selecting second occurence of characters

I have a relatively simple problem but can't figure out the right syntax in RegEx. I have multiple experiment names as strings in various formats, e.g. SEF001DT45 or BV004MF.

What I want to do is to select the second occurence of two letters after a numeric value (DT and MF in this case).

I figured out that [A-Z]{2} solves my problem only halfway. How do I get the proper substrings?

Upvotes: 5

Views: 119

Answers (5)

GKi
GKi

Reputation: 39657

Maybe:

s <- c("SEF001DT45", "BV004MF")
sub("[A-Z]+\\d+([A-Z]{2}).*", "\\1", s)
#sub("[A-Z]+[0-9]+([A-Z]{2}).*", "\\1", s) #Alternative
#[1] "DT" "MF"

Where [A-Z] matches characters, \\d numbers, [A-Z]{2} the two characters and .* for the remaining rest.
With () the content which is inserted with \\1 is selected.
Or something more strict about the second occurence of two letters:

sub(".*?[A-Z]{2}[0-9]+([A-Z]{2}).*", "\\1", s)
#[1] "DT" "MF"

When only the two characters after the first number should be extracted is enough:

regmatches(s, regexpr("(?<=\\d)[A-Z]{2}", s, perl=TRUE))
#[1] "DT" "MF"

Upvotes: 3

ThomasIsCoding
ThomasIsCoding

Reputation: 101189

Another base R trick is strsplit

> sapply(strsplit(s, split = "\\d+"), `[[`, 2)
[1] "DT" "MF"

or gsub

> gsub("^.*?(?<=\\d)(\\D+).*", "\\1", s, perl = TRUE)
[1] "DT" "MF"

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

TLDR: Generally, you can get the second occurrence of a PATTERN using one of the following

sub('.*?PATTERN.*?(PATTERN).*', '\\1', x)
stringr::str_match(x, 'PATTERN.*?(PATTERN)')[,2]
regmatches(x, regexpr('PATTERN.*?\\KPATTERN', x, perl=TRUE))

Details

You can use

x <- c('SEF001DT45','BV004MF')
sub('.*?[A-Z]{2}.*?([A-Z]{2}).*', '\\1', x)
## => [1] "DT" "MF"

See the R demo online and the regex demo. The point here is to match up to the second occurrence of the pattern, capture it, and then match the rest, and replace with the backreference to the capturing group value.

Note that sub will perform a single search and replace operation, and this is fine since the regex here requires the whole string match.

Details:

  • .*? - any zero or more chars as few as possible
  • [A-Z]{2} - two uppercase ASCII letters
  • .*? - any zero or more chars as few as possible
  • ([A-Z]{2}) - Group 1 (\1 refers to this group value): two uppercase ASCII letters
  • .* - any zero or more chars as many as possible.

You can achieve this with a simpler regex using stringr::str_match:

x <- c('SEF001DT45','BV004MF')
library(stringr)
results <- stringr::str_match(x, '[A-Z]{2}.*?([A-Z]{2})')
results[,2] ## Get Group 1 values

See this R demo.

Or, with regmatches/regexpr in base R:

x <- c('SEF001DT45','BV004MF')
results <- regmatches(x, regexpr('[A-Z]{2}.*?\\K[A-Z]{2}', x, perl=TRUE))
results

See this R demo.

Here, [A-Z]{2}.*?\\K[A-Z]{2} finds the first two uppercase ASCII letters, then matches any zero or more chars (other than line break chars since the PCRE engine is used) as few as possible, and then \K discards the matched text and the [A-Z]{2} at the end of the pattern matches the second occurrence of the two-letter chunk. regexpr only finds the first match.

Upvotes: 3

hello_friend
hello_friend

Reputation: 5788

Base R:

# Using capture groups:
gsub(
  ".*\\d{2}(\\w{2}).*",
  "\\1",
  x
)

# Input data:
x <- c(
  'SEF001DT45',
  'BV004MF'
)

Upvotes: 3

PaulS
PaulS

Reputation: 25323

A possible solution, based on stringr::str_extract and lookaround:

library(stringr)

strings <- c("SEF001DT45", "BV004MF")

str_extract(strings, "(?<=\\d)[:upper:]{2}")

#> [1] "DT" "MF"

Upvotes: 3

Related Questions