regex select multiple groups

Question

I have the following string from which I want to extract the content between the second pair of colons (in bold in the example):

"20160607181026_0000005:0607181026000000501:ES5206956802492:479"

I am using R and specifically the stringr package to manipulate strings. The command I attempted to use is:

str_extract("20160607181026_0000005:0607181026000000501:ES5206956802492:479", ":(.*):")

where the regex pattern is expressed at the end of the command. This produces the following result:

":0607181026000000501:ES5206956802492:"

I know that there is a way of grouping results and back-reference them, which would allow me to select only the part I am interested in, but I don't seem to be able to figure out the right syntax.

How can I achieve this?

akrun · Accepted Answer

If the first character after the : starts with LETTERS, then we can use a compact regex. Here, we use regex lookaround ((?<=:)) and match a LETTERS ([A-Z]) that follows the : followed by one of more characters that are not a : ([^:]+).

str_extract(v1, "(?<=:)[A-Z][^:]+")
#[1] "ES5206956802492"

or if it is based on the position i.e. 2nd position, a base R option would be to match zero or more non : ([^:]*) followed by the first : followed by zero or more non : followed by the second : and then we capture the non : in a group ((...)) and followed by rest of the characters (.*). In the replacement, we use the backreference, i.e. \1 (first capture group).

sub("[^:]*:[^:]*:([^:]+).*", "\1", v1)
#[1] "ES5206956802492"

Or the repeating part can be captured to make it compact

sub("([^:]*:){2}([^:]+).*", "\2", v1)
#[1] "ES5206956802492"

Or with strsplit, we split at delimiter : and extract the 3rd element.

strsplit(v1, ":")[[1]][3]
#[1] "ES5206956802492"

data

v1 <- "20160607181026_0000005:0607181026000000501:ES5206956802492:479"

regex select multiple groups

Answers (2)

data

Related Questions