Reputation: 31
I would like to perform the following calculations:
Input:
Column_A Column_B
Word_A 10
Word_A Word_B 20
Word_B Word_A 30
Word_A Word_B Word_C 40
Output:
Column_A1 Column_B1
Word_A 100 = 10+20+30+40
Word_B 90 = 20+30+40
Word_C 40 = 40
Word_A Word_B 90 = 20+30+40
Word_A Word_C 40 = 40
Word_B Word_C 40 = 40
Word_A Word_B Word_C 40 = 40
The order of the words in the output does not matter, so Word_A Word_B = 90 = Word_B Word_A. Using RWeka and tm libraries I was able to extract unigrams (just one word), bit I will need to have n-gram where n=1,2,3 and calculate column_B1
Upvotes: 1
Views: 120
Reputation: 43354
A tidyverse approach:
library(tidyverse)
library(tokenizers)
df %>%
rowwise() %>%
mutate(ngram = list(c(tokenize_ngrams(Column_A, lowercase = FALSE, n = 3, n_min = 1),
tokenize_skip_ngrams(Column_A, lowercase = FALSE, n = 2),
recursive = TRUE)),
ngram = list(unique(map_chr(strsplit(ngram, ' '),
~paste(sort(.x), collapse = ' '))))) %>%
unnest() %>%
count(ngram, wt = Column_B)
## # A tibble: 7 × 2
## ngram n
## <chr> <int>
## 1 Word_A 100
## 2 Word_A Word_B 90
## 3 Word_A Word_B Word_C 40
## 4 Word_A Word_C 40
## 5 Word_B 90
## 6 Word_B Word_C 40
## 7 Word_C 40
Note this is currently only robust through strings of three words. For longer strings you would have to figure out how far you want skip ngrams to go, or take a different approach altogether.
Upvotes: 1