kfkashfkaugfjbicbc
kfkashfkaugfjbicbc

Reputation: 31

Ngram in R: calculating word frequency and sum of values

I would like to perform the following calculations:

Input:

Column_A                    Column_B
Word_A                      10
Word_A Word_B               20
Word_B Word_A               30
Word_A Word_B Word_C        40

Output:

Column_A1                   Column_B1
Word_A                      100 = 10+20+30+40
Word_B                      90  = 20+30+40
Word_C                      40  = 40
Word_A Word_B               90  = 20+30+40
Word_A Word_C               40  = 40
Word_B Word_C               40  = 40
Word_A Word_B Word_C        40  = 40

The order of the words in the output does not matter, so Word_A Word_B = 90 = Word_B Word_A. Using RWeka and tm libraries I was able to extract unigrams (just one word), bit I will need to have n-gram where n=1,2,3 and calculate column_B1

Upvotes: 1

Views: 120

Answers (1)

alistaire
alistaire

Reputation: 43354

A tidyverse approach:

library(tidyverse)
library(tokenizers)

df %>% 
    rowwise() %>% 
    mutate(ngram = list(c(tokenize_ngrams(Column_A, lowercase = FALSE, n = 3, n_min = 1), 
                              tokenize_skip_ngrams(Column_A, lowercase = FALSE, n = 2), 
                          recursive = TRUE)), 
           ngram = list(unique(map_chr(strsplit(ngram, ' '), 
                                       ~paste(sort(.x), collapse = ' '))))) %>% 
    unnest() %>% 
    count(ngram, wt = Column_B)

## # A tibble: 7 × 2
##                  ngram     n
##                  <chr> <int>
## 1               Word_A   100
## 2        Word_A Word_B    90
## 3 Word_A Word_B Word_C    40
## 4        Word_A Word_C    40
## 5               Word_B    90
## 6        Word_B Word_C    40
## 7               Word_C    40

Note this is currently only robust through strings of three words. For longer strings you would have to figure out how far you want skip ngrams to go, or take a different approach altogether.

Upvotes: 1

Related Questions