Reputation: 165
What I am looking to do is find the frequencies of multiple words/phrases and plot them in a graph by year.
I have been able to do this one single words like "american", but am having trouble with more than one word expressions such as "united states".
My df has a column for the actual text and then additional columns for metadata like author, year, and organization.
This is the code I used to do single words like "american":
a_corpus <- corpus(df, text = "text")
freq_grouped_year <- textstat_frequency(dfm(tokens(a_corpus)),
groups = a_corpus$Year)
# COLLECTION NAME - Filter the term "american", use lower case words
freq_word_year <- subset(freq_grouped_year, freq_grouped_year$feature
%in% "american")
ggplot(freq_word_year, aes(x = group, y = frequency)) +
geom_point() +
scale_y_continuous(limits = c(0, 300), breaks = c(seq(0, 300,
30))) +
xlab(NULL) +
ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
When I try to use a bigram like "united states", nothing shows up. From what I understand, dfm creates a list of individual words so they are not ordered in anyway, so looking for bigrams or more, doesn't work.
Is there a way to find frequencies for bigrams, trigrams or more?
Thank you!
Upvotes: 0
Views: 480
Reputation: 14902
To identify compound tokens, or in quanteda terminology, phrases, you need to compound the tokens using the list of fixed compounds. (There are other ways, such as using textstat_collocations()
with filtering, but since you have a fixed list here for selection, this is the simplest.)
library("quanteda")
## Package version: 3.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
a_corpus <- head(data_corpus_inaugural)
toks <- tokens(a_corpus)
toks <- tokens_compound(toks, phrase("United States"), concatenator = " ")
freq_grouped_year <- textstat_frequency(dfm(toks, tolower = FALSE), groups = Year)
freq_word_year <- subset(freq_grouped_year, freq_grouped_year$feature %in% "United States")
library("ggplot2")
ggplot(freq_word_year, aes(x = group, y = frequency)) +
geom_point() +
# scale_y_continuous(limits = c(0, 300), breaks = c(seq(0, 300, 30))) +
xlab(NULL) +
ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Upvotes: 2