SteveS
SteveS

Reputation: 4040

R Regex capture group?

I have a lot of strings like this:

2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0

I want to extract the substring that lays right after the last "/" and ends with "_":

556662

I have found out how to extract: /01/01/07/556662

by using the following regex: (\/)(.*?)(?=\_)

Please advise how can I capture the right group.

Upvotes: 3

Views: 7052

Answers (3)

The fourth bird
The fourth bird

Reputation: 163362

You could use a capturing group:

/([^_/]+)_[^/\s]*

Explanation

  • / Match literally
  • ([^_/]+) Capture in a group matching not an underscore or forward slash
  • _[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character

Regex demo | R demo

One option to get the capturing group might be to get the second column using str_match:

library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]

# [1] "556662"

Upvotes: 5

Joseph
Joseph

Reputation: 90

I changed the Regex rules according to the code of Wiktor Stribiżew.

x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)

Output

[1] "2019/01/01/07/556662"

[1] "556662"

R demo

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

You may use

x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"

See the regex and R demo.

Here, the regex matches and outputs the first substring that matches

  • .*/ - any 0+ chars as many as possible up to the last /
  • \K - omits this part from the match
  • [^_]+ - puts 1 or more chars other than _ into the match value.

Or, a sub solution:

sub(".*/([^_]+).*", "\\1", x)

See the regex demo.

Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).

Alternative non-base R solutions

If you can afford or prefer to work with stringi, you may use

library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"

This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.

Or

stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"

This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.

Upvotes: 5

Related Questions