Atwp67
Atwp67

Reputation: 307

Extracting words to the left and right of key term

Delving further into text mining and had a client recently ask if it was possible to get up to 5 words preceding and proceeding a key term. Example...

To get the full effect of tongue twisters you should repeat them several times, as quickly as possible, without stumbling or mispronouncing.

Key term=twisters
Preceding 5 words=the full effect of tongue
Proceeding 5 words=you should repeat them several

The long term plan is to take the 10 most frequent terms, along with the preceding and proceeding words, and load into a data.frame. I've poked around a little with gsub but to no avail.

Any thoughts, guidance, etc., would be appreciated.

Upvotes: 1

Views: 132

Answers (3)

Jota
Jota

Reputation: 17611

The quanteda package has a function specifically for returning key words in context: kwic. It's using stringi under the hood.

library(quanteda)
kwic(txt, keywords = "twisters", window = 5, case_insensitive = TRUE)
#                            contextPre  keyword                      contextPost
#[text1, 8] the full effect of tongue [ twisters ] you should repeat them several
#[text2, 2]                       The [ twisters ] are always twisting           
#[text3, 9]  for those guys, they are [ twisters ] of words and will tell
#[text4, 1]                           [ Twisters ] will ruin your life. 

sample text:

# sample text
txt <- c("To get the full effect of tongue twisters you should repeat them several times, as quickly as possible, without stumbling or mispronouncing.",
         "The twisters are always twisting",
         "watch out for those guys, they are twisters of words and will tell a yarn a mile long",
         "Twisters will ruin your life.")

Upvotes: 2

Sotos
Sotos

Reputation: 51592

You can use word from stringr,

library(stringr)
ind <- sapply(strsplit(x, ' '), function(i) which(i == 'twisters'))
word(x, ind-5, ind-1)
#[1] "the full effect of tongue"
word(x, ind+1, ind+5)
#[1] "you should repeat them several"

Upvotes: 3

Alexey Shiklomanov
Alexey Shiklomanov

Reputation: 1642

Use strsplit to split the string into a vector, and then use grep to get the right indices. If you're doing this a lot, you should wrap it in a function.

x <- "To get the full effect of tongue twisters you should repeat them several times, as quickly as possible, without stumbling or mispronouncing."
x_split <- strsplit(x, " ")[[1]]
key <- "twisters"
key_index <- grep(key, x)
before <- x_split[(key_index - 5):(key_index - 1)]
after <- x_split[(key_index + 1):(key_index + 5)]
before
#[1] "the"    "full"   "effect" "of"     "tongue"
after
#[1] "you"     "should"  "repeat"  "them"    "several"
paste(before, collapse = " ")
#[1] "the full effect of tongue"
paste(after, collapse = " ")
#[1] "you should repeat them several" 

Upvotes: 0

Related Questions