Reputation: 1192
I need to extract n words that appear before and after a term for a text analysis that I'm working on. Below is a reproducible example:
a <- c("The day was nice and dry, when she came for our game we were ready and then she left.",
"The day was nice and dry, when she came for our game, but we were not ready. She left after she waited 5 minutes.",
"The day was nice and dry, when she came, we were not here. Our game was not completed timely, but it was completed after one hour.")
Below is the function that Im using but it does not work for situations where there is punctuation around a word or double spaces.
gsub(".*(( \\w{1,}){3} game( \\w{1,}){3}).*", "\\1", a, perl = TRUE)
[1] " came for our game we were ready"
[2] "The day was nice and dry, when she came for our game, but we were not ready. She left after she waited 5 minutes."
[3] "The day was nice and dry, when she came, we were not here. Our game was was not completed timely, but it was completed after one hour."
below is the desired output
[1] " came for our game we were ready"
[2] " came for our game, but we were"
[3] " not here. Our game was not completed"
Upvotes: 4
Views: 1145
Reputation: 9705
Instead of using space, try \\W{1,}
:
gsub(".*(((\\W{1,})\\w{1,}){3} game((\\W{1,})\\w{1,}){3}).*", "\\1", a, perl = TRUE)
[1] " came for our game we were ready"
" came for our game, but we were"
" not here. Our game was not completed"
Upvotes: 2
Reputation: 5215
Here's another approach with str_extract
from the stringr
package:
library(stringr)
str_extract(a, "(( \\S+){3} game[[:punct:]\\s]*( \\S+){3})")
# [1] " came for our game we were ready"
# " came for our game, but we were"
# " not here. Our game was not completed"
Upvotes: 0