histelheim
histelheim

Reputation: 5088

Extracting capturing groups from a regex

This regex: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])* matches an expression using multiple groups. The point of the regex is that it captures patterns in pairs of two, where the first part of the regex has to be followed by the second part of the regex.

How can I extract each of these two groups?

library(stringr)
data <- c("A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7")
str_extract_all(data, "(.*?)(?:I[0-9]-)*I3(?:-I[0-9])*")

Gives me:

[[1]]
[1] "A-B-C-I1-I2-D-E-F-I1-I3"          "-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7"

However, I would want something along the lines of:

[[1]]
[1] "A-B-C-I1-I2-D-E-F" [2] "I1-I3"
[[2]]
[1] "D-D-D-D" [2] "I1-I1-I2-I1-I1-I3-I3-I7"

The key here is that regex matches twice, each time containing 2 groups. I want every match to have a list of it's own, and that list to contain 2 elements, one for each group.

Upvotes: 1

Views: 373

Answers (1)

hwnd
hwnd

Reputation: 70732

You need to wrap a capturing group around the second part of your expression and if you're using stringr for this task, I would use str_match_all instead to return the captured matches ...

library(stringr)

data <- c('A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7')
mat <- str_match_all(data, '-?(.*?)-((?:I[0-9]-)*I3(?:-I[0-9])*)')[[1]][,2:3]
colnames(mat) <- c('Group 1', 'Group 2')

#      Group 1             Group 2                  
# [1,] "A-B-C-I1-I2-D-E-F" "I1-I3"                  
# [2,] "D-D-D-D"           "I1-I1-I2-I1-I1-I3-I3-I7"

Upvotes: 1

Related Questions