Laurence_jj
Laurence_jj

Reputation: 726

Select second word in between known phrases - R regex

I would like to select text in between known phrases but exclude the first word using R and regex. The format is as follows

"known phrase + unknown_word + target phrase + known_word + bla bla"

for example:

Tesco Plc sells coffee beans today in stores over the uk

Known phrase = "Tesco Plc"
Unknown word = "sells"
Target phrase = "coffee beans"
known word = "today"
bla bla (unrelated text) = "in stores over the uk"

Initial Attempt

text = "Tesco Plc sells coffee beans today in stores over the uk"
known_phrase = "Tesco Plc"
known_word = "today"

# code
str_extract(text, paste0("(?<=",known_phrase,").*(?=", known_word ,")"))]

This selects both the unknown_word and target phrase. But I just want the target phrase/

Upvotes: 1

Views: 152

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626851

You can use

stringr::str_match(x, "Tesco\\s+Plc\\s+\\w+\\s+(.*?)\\s+today")[,2]
## OR
Known_phrase = "Tesco Plc"
known_word = "today"
stringr::str_match(x, paste0(Known_phrase, "\\s+\\w+\\s+(.*?)\\s+", known_word))[,2]

You might need an escaping function since your variables are dynamic:

regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
Known_phrase = "Tesco Plc"
known_word = "today"
stringr::str_match(x, paste0(regex.escape(Known_phrase), "\\s+\\w+\\s+(.*?)\\s+", regex.escape(known_word)))[,2]

Upvotes: 1

Related Questions