user3357059
user3357059

Reputation: 1192

r gsub extract n words before and after a term

I need to extract n words that appear before and after a term for a text analysis that I'm working on. Below is a reproducible example:

a <- c("The day was nice and dry, when she came for our game we were ready and then she left.",
"The day was nice and dry, when she came for our game, but we were not ready. She left after she waited 5 minutes.",
"The day was nice and dry, when she came, we were not here. Our game  was not completed timely, but it was completed after one hour.")

Below is the function that Im using but it does not work for situations where there is punctuation around a word or double spaces.

gsub(".*(( \\w{1,}){3} game( \\w{1,}){3}).*", "\\1", a, perl = TRUE)


[1] " came for our game we were ready"                                                                                                  
[2] "The day was nice and dry, when she came for our game, but we were not ready. She left after she waited 5 minutes."                 
[3] "The day was nice and dry, when she came, we were not here. Our game  was was not completed timely, but it was completed after one hour."

below is the desired output

[1] " came for our game we were ready"                                                                                                  
[2] " came for our game, but we were"                 
[3] " not here. Our game was not completed"

Upvotes: 4

Views: 1145

Answers (2)

thc
thc

Reputation: 9705

Instead of using space, try \\W{1,}:

gsub(".*(((\\W{1,})\\w{1,}){3} game((\\W{1,})\\w{1,}){3}).*", "\\1", a, perl = TRUE)

[1] " came for our game we were ready"       
" came for our game, but we were"        
" not here. Our game  was not completed"

Upvotes: 2

cmaher
cmaher

Reputation: 5215

Here's another approach with str_extract from the stringr package:

library(stringr)

str_extract(a, "(( \\S+){3} game[[:punct:]\\s]*( \\S+){3})")

# [1] " came for our game we were ready"       
#     " came for our game, but we were"        
#     " not here. Our game  was not completed"

Upvotes: 0

Related Questions