thepule
thepule

Reputation: 1751

regex select multiple groups

I have the following string from which I want to extract the content between the second pair of colons (in bold in the example):

"20160607181026_0000005:0607181026000000501:ES5206956802492:479"

I am using R and specifically the stringr package to manipulate strings. The command I attempted to use is:

str_extract("20160607181026_0000005:0607181026000000501:ES5206956802492:479", ":(.*):")

where the regex pattern is expressed at the end of the command. This produces the following result:

":0607181026000000501:ES5206956802492:"

I know that there is a way of grouping results and back-reference them, which would allow me to select only the part I am interested in, but I don't seem to be able to figure out the right syntax.

How can I achieve this?

Upvotes: 2

Views: 283

Answers (2)

Sotos
Sotos

Reputation: 51592

Also word from stringr,

library(stringr)
word(v1, 3, sep=':')
#[1] "ES5206956802492"

Upvotes: 3

akrun
akrun

Reputation: 887138

If the first character after the : starts with LETTERS, then we can use a compact regex. Here, we use regex lookaround ((?<=:)) and match a LETTERS ([A-Z]) that follows the : followed by one of more characters that are not a : ([^:]+).

str_extract(v1, "(?<=:)[A-Z][^:]+")
#[1] "ES5206956802492"

or if it is based on the position i.e. 2nd position, a base R option would be to match zero or more non : ([^:]*) followed by the first : followed by zero or more non : followed by the second : and then we capture the non : in a group ((...)) and followed by rest of the characters (.*). In the replacement, we use the backreference, i.e. \\1 (first capture group).

sub("[^:]*:[^:]*:([^:]+).*", "\\1", v1)
#[1] "ES5206956802492"

Or the repeating part can be captured to make it compact

sub("([^:]*:){2}([^:]+).*", "\\2", v1)
#[1] "ES5206956802492"

Or with strsplit, we split at delimiter : and extract the 3rd element.

strsplit(v1, ":")[[1]][3]
#[1] "ES5206956802492"

data

v1 <- "20160607181026_0000005:0607181026000000501:ES5206956802492:479"

Upvotes: 2

Related Questions