Reputation: 61
This is my first time attempting to extract a string using gsub and regular expressions in R. I would like to extract three words after the first occurrence of the word "at" or "around" in each cell of a text column (col in example) and place the extraction into a new column (new_extract).
What I have thus far is the following:
df$new_extract <- gsub(".*at(\\w{1,}){3}).*", "\\1", df$col, perl = TRUE)
Any advice on changes / different approaches welcomed!
Upvotes: 1
Views: 571
Reputation: 626738
Your regex attempts to match words only after the last at
. Also, since there is no pattern to match the gap between at
or around
(you are not trying to match around
at all by the way), your pattern will not extract any words in the end.
I suggest this approach with sub
:
sub(".*?\\ba(?:t|round)\\W+(\\w+(?:\\W+\\w+){0,2}).*", "\\1", df$col, perl=TRUE)
See the regex demo.
Here,
.*?
- matches from the start, any zero or more chars other than line break chars as few as possible\ba
- a word boundary and then a
(?:t|round)
- t
or round
\W+
- one or more non-word chars(\w+(?:\\W+\\w+){0,2})
- Group 1: one or more word chars and then zero, one or two occurrences of one or more non-word chars followed with one or more word chars.*
- any zero or more chars other than line break chars as many as possible.Upvotes: 1