Find frequencies of multiple words combined as one?

Question

I am trying to find the frequencies of several words totalled.

For example, I am using this code to find the frequencies of some words

keyterms <- c("canadian", "american", "british")
dict <- dictionary(list(keyterms2 = c("canadian", "american", "british"))))


dfm <- dfm(toks) %>%
  dfm_group(groups = "Organization") %>%
  dfm_select(pattern = keyterms)

When I run the above using keyterms and the dictionary, I get the frequencies for each word individually.

A header	canadian	american	british
Organization	10	10	10

Is there a way to write the script so that it returns the frequencies totalled up so that it looks like this:

A header	terms
Organization	30

Thank you

Ken Benoit · Accepted Answer

The dictionary approach is the most elegant solution, since it combines your keyword terms.

Here, I've illustrated how you can do this with the built-in inaugural corpus, where your groups (similar to your "Organization") is the president's name.

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

keyterms <- c("canadian", "american", "british")
dict <- dictionary(list(terms = keyterms))

toks <- data_corpus_inaugural %>%
  corpus_subset(Year > 2000) %>%
  tokens() %>%
  tokens_lookup(dictionary = dict)

dfm(toks) %>%
  dfm_group(groups = President) %>%
  convert(to = "data.frame")
##   doc_id terms
## 1  Biden     9
## 2   Bush     6
## 3  Obama     8
## 4  Trump    11

(You can rename the first column to "A header" if you wish.)

Note that the usage for groups changed in quanteda 3.0, so now its value should not be quoted.

Find frequencies of multiple words combined as one?

Answers (2)

Related Questions