Reputation: 789
I have been wondering if it is possible to perform the feauture_frequency
of the powerful quanteda
library in R including also a list of phrases or "words" to be accounted for, for instance I have the following data set:
library(quanteda)
library(quanteda.textstats)
df_sample<-c("Word Record",
"be able to count by word",
"But also include some phrases such as",
"World Record Super Bass Mr. President Mr. President")
When I calculate the textstat_frequency
of the df_sample I get something like this:
> tokens<-corpus(df_sample) %>% tokens(remove_punct = TRUE)
> dfm<-dfm(tokens)
>
> quanteda.textstats::textstat_frequency(dfm)
feature frequency rank docfreq group
1 word 2 1 2 all
2 record 2 1 2 all
3 mr 2 1 1 all
4 president 2 1 1 all
5 be 1 5 1 all
6 able 1 5 1 all
7 to 1 5 1 all
8 count 1 5 1 all
9 by 1 5 1 all
10 but 1 5 1 all
11 also 1 5 1 all
12 include 1 5 1 all
13 some 1 5 1 all
14 phrases 1 5 1 all
15 such 1 5 1 all
16 as 1 5 1 all
17 world 1 5 1 all
18 super 1 5 1 all
19 bass 1 5 1 all
>
which is correct but I also want to change my code in other to take into account and print in the output the words or phrases "Mr. President", "World Record", "Super Bass"
key_lookups<-c("Mr. President", "World Record", "Super Bass" )
How can I use quanteda
funs to have in my output along with the previous counts also the frequency of the previous phrases,for example
"Mr. President" 2 "World Record" 2 "Super Bass" 1
Upvotes: 0
Views: 223
Reputation: 789
In the quanteda
library one could take advange of the fun tokens_compound
library(quanteda)
library(quanteda.textstats)
df_sample<-c("World Record",
"be able to count by word",
"But also include some phrases such as",
"World Record Super Bass Mr. President Mr. President")
toks <- tokens(df_sample,remove_punct = TRUE)
Now lets compound the key_lookups
over the toks
object
key_lookups<-c("Mr President", "World Record", "Super Bass" )
toks_comp <- tokens_compound(toks, pattern = phrase(key_lookups))
Take a look at the output:
> toks_comp %>% dfm() %>% textstat_frequency()
feature frequency rank docfreq group
1 world_record 2 1 2 all
2 mr_president 2 1 1 all
3 be 1 3 1 all
4 able 1 3 1 all
5 to 1 3 1 all
6 count 1 3 1 all
7 by 1 3 1 all
8 word 1 3 1 all
9 but 1 3 1 all
10 also 1 3 1 all
11 include 1 3 1 all
12 some 1 3 1 all
13 phrases 1 3 1 all
14 such 1 3 1 all
15 as 1 3 1 all
16 super_bass 1 3 1 all
Upvotes: 2
Reputation: 23608
First: a warning about your example code: do not create objects that have the same name as functions (like tokens and dfm) this will (eventually) lead to errors and is difficult to debug.
There are probably a few ways of doing this. I created a "normal" tokens object and one ngrams tokens object. both turned into dfm's and from the ngrams dfm, I kept the phrases you wanted. Then combined the dfm's and you can use textstat_frequency
as normal.
Note: you can't combine tokens objects like you can combine dfm objects.
library(quanteda)
library(quanteda.textstats)
df_sample<-c("Word Record",
"be able to count by word",
"But also include some phrases such as",
"World Record Super Bass Mr. President Mr. President")
my_tokens <- corpus(df_sample) %>% tokens(remove_punct = TRUE)
my_dfm <- dfm(my_tokens)
# No points as they are removed in the dfm
key_lookups<-c("Mr President", "World Record", "Super Bass" )
my_tokens_ngram <- tokens_ngrams(my_tokens, n = 2, concatenator = " ")
my_dfm_ngrams <- dfm(my_tokens_ngram)
# Only keep the lookups
my_dfm_ngrams <- dfm_keep(my_dfm_ngrams, key_lookups)
# Combine both dfms
my_dfms <- rbind(my_dfm, my_dfm_ngrams)
# if necessary uncomment next part
# my_dfms <- dfm_compress(my_dfms)
outcome:
head(textstat_frequency(my_dfms), 5)
feature frequency rank docfreq group
1 word 2 1 2 all
2 record 2 1 2 all
3 mr 2 1 1 all
4 president 2 1 1 all
5 mr president 2 1 1 all
tail(textstat_frequency(my_dfms), 5)
feature frequency rank docfreq group
18 world 1 6 1 all
19 super 1 6 1 all
20 bass 1 6 1 all
21 world record 1 6 1 all
22 super bass 1 6 1 all
Do note that using rbind on dfms, creates a new document name like "text1.1". If you want this merged back to the original documents, you can call dfm_compress(my_dfms)
first and then call textstat_frequency
.
Upvotes: 0