Reputation: 726
I would like to select text in between known phrases but exclude the first word using R and regex. The format is as follows
"known phrase + unknown_word + target phrase + known_word + bla bla"
for example:
Tesco Plc sells coffee beans today in stores over the uk
Known phrase = "Tesco Plc"
Unknown word = "sells"
Target phrase = "coffee beans"
known word = "today"
bla bla (unrelated text) = "in stores over the uk"
Initial Attempt
text = "Tesco Plc sells coffee beans today in stores over the uk"
known_phrase = "Tesco Plc"
known_word = "today"
# code
str_extract(text, paste0("(?<=",known_phrase,").*(?=", known_word ,")"))]
This selects both the unknown_word
and target phrase
. But I just want the target phrase
/
Upvotes: 1
Views: 152
Reputation: 626851
You can use
stringr::str_match(x, "Tesco\\s+Plc\\s+\\w+\\s+(.*?)\\s+today")[,2]
## OR
Known_phrase = "Tesco Plc"
known_word = "today"
stringr::str_match(x, paste0(Known_phrase, "\\s+\\w+\\s+(.*?)\\s+", known_word))[,2]
You might need an escaping function since your variables are dynamic:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
Known_phrase = "Tesco Plc"
known_word = "today"
stringr::str_match(x, paste0(regex.escape(Known_phrase), "\\s+\\w+\\s+(.*?)\\s+", regex.escape(known_word)))[,2]
Upvotes: 1