Afiq Johari
Afiq Johari

Reputation: 1462

r return ngram from a vector which contains specific string

I want to generate ngram keywords from a vector, given a specific string. For example, let's say I would need bigram, for each element of the vector I want to extract relevant bigrams and concatenate all the bigrams which has the keyword ncd.

have <- c('add the ncd mse to the website', 'setup new ncd staffs on t&ta for wireless. all new ncd should go to horizon', 'map out current ncd post locations on 1st, 2nd floors')

want_bigram <- c('the ncd | ncd mse', 'new ncd | ncd staffs | new ncd |ncd should', 'current ncd | ncd post')

Thank you

Upvotes: 0

Views: 221

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388982

Using str_extract_all you can extract strings with with one word before and after 'ncd' and combine them with mapply.

library(stringr)

mapply(function(x, y) paste(c(x, y), collapse = ' | '), 
    str_extract_all(have, '(\\w+ ncd)'), str_extract_all(have, '(ncd \\w+)'))

#[1] "the ncd | ncd mse"                          
#[2] "new ncd | new ncd | ncd staffs | ncd should"
#[3] "current ncd | ncd post" 

Upvotes: 1

Desmond
Desmond

Reputation: 1137

You can do this with the tidytext library.

library(tidytext)

want <- have %>% 
  as_tibble() %>% 
  mutate(row = row_number()) %>% 
  unnest_tokens(bigrams, value, token = "ngrams", n = 2) %>% 
  filter(str_detect(bigrams, "ncd")) %>% 
  group_by(row) %>% 
  summarize(text = paste0(bigrams, collapse = " | ")) %>% 
  pull(text)

Output:

> want  
[1] "the ncd | ncd mse"                          
[2] "new ncd | ncd staffs | new ncd | ncd should"
[3] "current ncd | ncd post"    

Upvotes: 1

Related Questions