Reputation: 307
Delving further into text mining and had a client recently ask if it was possible to get up to 5 words preceding and proceeding a key term. Example...
To get the full effect of tongue twisters you should repeat them several times, as quickly as possible, without stumbling or mispronouncing.
Key term=twisters
Preceding 5 words=the full effect of tongue
Proceeding 5 words=you should repeat them several
The long term plan is to take the 10 most frequent terms, along with the preceding and proceeding words, and load into a data.frame. I've poked around a little with gsub but to no avail.
Any thoughts, guidance, etc., would be appreciated.
Upvotes: 1
Views: 132
Reputation: 17611
The quanteda
package has a function specifically for returning key words in context: kwic
. It's using stringi
under the hood.
library(quanteda)
kwic(txt, keywords = "twisters", window = 5, case_insensitive = TRUE)
# contextPre keyword contextPost
#[text1, 8] the full effect of tongue [ twisters ] you should repeat them several
#[text2, 2] The [ twisters ] are always twisting
#[text3, 9] for those guys, they are [ twisters ] of words and will tell
#[text4, 1] [ Twisters ] will ruin your life.
sample text:
# sample text
txt <- c("To get the full effect of tongue twisters you should repeat them several times, as quickly as possible, without stumbling or mispronouncing.",
"The twisters are always twisting",
"watch out for those guys, they are twisters of words and will tell a yarn a mile long",
"Twisters will ruin your life.")
Upvotes: 2
Reputation: 51592
You can use word
from stringr
,
library(stringr)
ind <- sapply(strsplit(x, ' '), function(i) which(i == 'twisters'))
word(x, ind-5, ind-1)
#[1] "the full effect of tongue"
word(x, ind+1, ind+5)
#[1] "you should repeat them several"
Upvotes: 3
Reputation: 1642
Use strsplit
to split the string into a vector, and then use grep
to get the right indices. If you're doing this a lot, you should wrap it in a function.
x <- "To get the full effect of tongue twisters you should repeat them several times, as quickly as possible, without stumbling or mispronouncing."
x_split <- strsplit(x, " ")[[1]]
key <- "twisters"
key_index <- grep(key, x)
before <- x_split[(key_index - 5):(key_index - 1)]
after <- x_split[(key_index + 1):(key_index + 5)]
before
#[1] "the" "full" "effect" "of" "tongue"
after
#[1] "you" "should" "repeat" "them" "several"
paste(before, collapse = " ")
#[1] "the full effect of tongue"
paste(after, collapse = " ")
#[1] "you should repeat them several"
Upvotes: 0