absynth21
absynth21

Reputation: 23

REGEX PCRE characters between 2 nth occurrences

I have data that resembles the following structure. I need to extract the data that is between the third occurrence of "May 2016" and "Jun 2016".

I have the following pattern which (to be frank) is not properly constructed (And it doesn't bring back the characters I want).

(.*(?>May 2016)){3}(.*(?=Jun 2016)){3}/s

I am new to using Regex, can someone help me with the correct expression please.

May 2016 ef Jun 2016 efef May 2016 Jun 2016 May 2016

dffdg def efef

Jun 2016

May 2016

Jun 2016

Upvotes: 1

Views: 77

Answers (3)

Lucas Trzesniewski
Lucas Trzesniewski

Reputation: 51330

Here you go (this requires perl = TRUE):

(?s)(?:.*?May 2016){3}\K.*?(?=Jun 2016)

Demo

Explanation:

  • (?s) activate the singleline option
  • (?:.*?May 2016){3} match May 2016 3 times with random text in-between
  • \K discard what you've matched so far from the match value
  • .*? match anything
  • (?=Jun 2016) ... up to the first occurence of Jun 2016

Upvotes: 1

rawr
rawr

Reputation: 20811

A couple ways

tt <- readLines(textConnection("May 2016 ef Jun 2016 efef May 2016 Jun 2016 May 2016

dffdg def efef

Jun 2016

May 2016

Jun 2016"))

(tt <- paste0(tt, collapse = ''))
# [1] "May 2016 ef Jun 2016 efef May 2016 Jun 2016 May 2016dffdg def efefJun 2016May 2016Jun 2016"


m <- gregexpr('May 2016(.*?)Jun 2016', tt, perl = TRUE)
mapply(function(x, y) substr(tt, x, x + y - 1),
       attr(m[[1]], 'capture.start'), attr(m[[1]], 'capture.length'))[3]
# [1] "dffdg def efef"


gsub('May.*May.*May 2016(.*?)Jun 2016.*', '\\1', tt)
# [1] "dffdg def efef"

Upvotes: 1

Julius Vainora
Julius Vainora

Reputation: 48191

If one may assume that "May 2016" and "Jun 2016" alternate and the former goes first, then

x <- "May 2016 A Jun 2016 B May 2016 Jun 2016 May 2016 C Jun 2016 May 2016 Jun 2016"
sub("(.*?May 2016.*?Jun 2016){2}.*?May 2016(.*?)Jun 2016.*", "\\2", x)
[1] " C "

Upvotes: 0

Related Questions