Reputation: 165
I am trying to find the frequencies of several words totalled.
For example, I am using this code to find the frequencies of some words
keyterms <- c("canadian", "american", "british")
dict <- dictionary(list(keyterms2 = c("canadian", "american", "british"))))
dfm <- dfm(toks) %>%
dfm_group(groups = "Organization") %>%
dfm_select(pattern = keyterms)
When I run the above using keyterms and the dictionary, I get the frequencies for each word individually.
A header | canadian | american | british |
---|---|---|---|
Organization | 10 | 10 | 10 |
Is there a way to write the script so that it returns the frequencies totalled up so that it looks like this:
A header | terms |
---|---|
Organization | 30 |
Thank you
Upvotes: 0
Views: 89
Reputation: 14902
The dictionary approach is the most elegant solution, since it combines your keyword terms.
Here, I've illustrated how you can do this with the built-in inaugural corpus, where your groups (similar to your "Organization") is the president's name.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
keyterms <- c("canadian", "american", "british")
dict <- dictionary(list(terms = keyterms))
toks <- data_corpus_inaugural %>%
corpus_subset(Year > 2000) %>%
tokens() %>%
tokens_lookup(dictionary = dict)
dfm(toks) %>%
dfm_group(groups = President) %>%
convert(to = "data.frame")
## doc_id terms
## 1 Biden 9
## 2 Bush 6
## 3 Obama 8
## 4 Trump 11
(You can rename the first column to "A header" if you wish.)
Note that the usage for groups
changed in quanteda 3.0, so now its value should not be quoted.
Upvotes: 1
Reputation: 388797
You can use rowSums
-
result <- dfm(toks) %>%
dfm_group(groups = "Organization") %>%
dfm_select(pattern = keyterms) %>%
rowSums()
Using stack(result)[2:1]
would return a dataframe.
Upvotes: 0