How to find and plot frequency of n-grams in R?

Question

What I am looking to do is find the frequencies of multiple words/phrases and plot them in a graph by year.

I have been able to do this one single words like "american", but am having trouble with more than one word expressions such as "united states".

My df has a column for the actual text and then additional columns for metadata like author, year, and organization.

This is the code I used to do single words like "american":

a_corpus <- corpus(df, text = "text")

freq_grouped_year <- textstat_frequency(dfm(tokens(a_corpus)), 
                               groups = a_corpus$Year)


# COLLECTION NAME - Filter the term "american", use lower case words 
freq_word_year <- subset(freq_grouped_year, freq_grouped_year$feature 
%in% "american")  


ggplot(freq_word_year, aes(x = group, y = frequency)) +
    geom_point() + 
    scale_y_continuous(limits = c(0, 300), breaks = c(seq(0, 300, 
    30))) +
    xlab(NULL) + 
    ylab("Frequency") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

When I try to use a bigram like "united states", nothing shows up. From what I understand, dfm creates a list of individual words so they are not ordered in anyway, so looking for bigrams or more, doesn't work.

Is there a way to find frequencies for bigrams, trigrams or more?

Thank you!

Ken Benoit · Accepted Answer

To identify compound tokens, or in quanteda terminology, phrases, you need to compound the tokens using the list of fixed compounds. (There are other ways, such as using textstat_collocations() with filtering, but since you have a fixed list here for selection, this is the simplest.)

library("quanteda")
## Package version: 3.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")

a_corpus <- head(data_corpus_inaugural)

toks <- tokens(a_corpus)
toks <- tokens_compound(toks, phrase("United States"), concatenator = " ")

freq_grouped_year <- textstat_frequency(dfm(toks, tolower = FALSE), groups = Year)
freq_word_year <- subset(freq_grouped_year, freq_grouped_year$feature %in% "United States")

library("ggplot2")
ggplot(freq_word_year, aes(x = group, y = frequency)) +
  geom_point() +
  # scale_y_continuous(limits = c(0, 300), breaks = c(seq(0, 300, 30))) +
  xlab(NULL) +
  ylab("Frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

How to find and plot frequency of n-grams in R?

Answers (1)

Related Questions