Reputation: 1462
I want to generate ngram
keywords from a vector, given a specific string.
For example, let's say I would need bigram
, for each element of the vector I want to extract relevant bigrams and concatenate all the bigrams which has the keyword ncd
.
have <- c('add the ncd mse to the website', 'setup new ncd staffs on t&ta for wireless. all new ncd should go to horizon', 'map out current ncd post locations on 1st, 2nd floors')
want_bigram <- c('the ncd | ncd mse', 'new ncd | ncd staffs | new ncd |ncd should', 'current ncd | ncd post')
Thank you
Upvotes: 0
Views: 221
Reputation: 388982
Using str_extract_all
you can extract strings with with one word before and after 'ncd' and combine them with mapply
.
library(stringr)
mapply(function(x, y) paste(c(x, y), collapse = ' | '),
str_extract_all(have, '(\\w+ ncd)'), str_extract_all(have, '(ncd \\w+)'))
#[1] "the ncd | ncd mse"
#[2] "new ncd | new ncd | ncd staffs | ncd should"
#[3] "current ncd | ncd post"
Upvotes: 1
Reputation: 1137
You can do this with the tidytext
library.
library(tidytext)
want <- have %>%
as_tibble() %>%
mutate(row = row_number()) %>%
unnest_tokens(bigrams, value, token = "ngrams", n = 2) %>%
filter(str_detect(bigrams, "ncd")) %>%
group_by(row) %>%
summarize(text = paste0(bigrams, collapse = " | ")) %>%
pull(text)
Output:
> want
[1] "the ncd | ncd mse"
[2] "new ncd | ncd staffs | new ncd | ncd should"
[3] "current ncd | ncd post"
Upvotes: 1