Reputation: 23
I have data that resembles the following structure. I need to extract the data that is between the third occurrence of "May 2016" and "Jun 2016".
I have the following pattern which (to be frank) is not properly constructed (And it doesn't bring back the characters I want).
(.*(?>May 2016)){3}(.*(?=Jun 2016)){3}/s
I am new to using Regex, can someone help me with the correct expression please.
May 2016 ef Jun 2016 efef May 2016 Jun 2016 May 2016
dffdg def efef
Jun 2016
May 2016
Jun 2016
Upvotes: 1
Views: 77
Reputation: 51330
Here you go (this requires perl = TRUE
):
(?s)(?:.*?May 2016){3}\K.*?(?=Jun 2016)
Explanation:
(?s)
activate the singleline option(?:.*?May 2016){3}
match May 2016
3 times with random text in-between\K
discard what you've matched so far from the match value.*?
match anything(?=Jun 2016)
... up to the first occurence of Jun 2016
Upvotes: 1
Reputation: 20811
A couple ways
tt <- readLines(textConnection("May 2016 ef Jun 2016 efef May 2016 Jun 2016 May 2016
dffdg def efef
Jun 2016
May 2016
Jun 2016"))
(tt <- paste0(tt, collapse = ''))
# [1] "May 2016 ef Jun 2016 efef May 2016 Jun 2016 May 2016dffdg def efefJun 2016May 2016Jun 2016"
m <- gregexpr('May 2016(.*?)Jun 2016', tt, perl = TRUE)
mapply(function(x, y) substr(tt, x, x + y - 1),
attr(m[[1]], 'capture.start'), attr(m[[1]], 'capture.length'))[3]
# [1] "dffdg def efef"
gsub('May.*May.*May 2016(.*?)Jun 2016.*', '\\1', tt)
# [1] "dffdg def efef"
Upvotes: 1
Reputation: 48191
If one may assume that "May 2016" and "Jun 2016" alternate and the former goes first, then
x <- "May 2016 A Jun 2016 B May 2016 Jun 2016 May 2016 C Jun 2016 May 2016 Jun 2016"
sub("(.*?May 2016.*?Jun 2016){2}.*?May 2016(.*?)Jun 2016.*", "\\2", x)
[1] " C "
Upvotes: 0