R_Student
R_Student

Reputation: 789

Quanteda calculating tokens frequency in dfm including also a customized list of phrases

I have been wondering if it is possible to perform the feauture_frequency of the powerful quanteda library in R including also a list of phrases or "words" to be accounted for, for instance I have the following data set:

library(quanteda)
library(quanteda.textstats)

df_sample<-c("Word Record",
             "be able to count by word",
             "But also include some phrases such as",
             "World Record Super Bass Mr. President Mr. President")

When I calculate the textstat_frequency of the df_sample I get something like this:

> tokens<-corpus(df_sample) %>% tokens(remove_punct = TRUE)
> dfm<-dfm(tokens)
> 
> quanteda.textstats::textstat_frequency(dfm)
     feature frequency rank docfreq group
1       word         2    1       2   all
2     record         2    1       2   all
3         mr         2    1       1   all
4  president         2    1       1   all
5         be         1    5       1   all
6       able         1    5       1   all
7         to         1    5       1   all
8      count         1    5       1   all
9         by         1    5       1   all
10       but         1    5       1   all
11      also         1    5       1   all
12   include         1    5       1   all
13      some         1    5       1   all
14   phrases         1    5       1   all
15      such         1    5       1   all
16        as         1    5       1   all
17     world         1    5       1   all
18     super         1    5       1   all
19      bass         1    5       1   all
> 

which is correct but I also want to change my code in other to take into account and print in the output the words or phrases "Mr. President", "World Record", "Super Bass"

key_lookups<-c("Mr. President", "World Record", "Super Bass" )

How can I use quanteda funs to have in my output along with the previous counts also the frequency of the previous phrases,for example

"Mr. President" 2 "World Record" 2 "Super Bass" 1

Upvotes: 0

Views: 223

Answers (2)

R_Student
R_Student

Reputation: 789

In the quanteda library one could take advange of the fun tokens_compound

library(quanteda)
library(quanteda.textstats)

df_sample<-c("World Record",
             "be able to count by word",
             "But also include some phrases such as",
             "World Record Super Bass Mr. President Mr. President")

toks <- tokens(df_sample,remove_punct = TRUE)

Now lets compound the key_lookups over the toks object

key_lookups<-c("Mr President", "World Record", "Super Bass" )
toks_comp <- tokens_compound(toks, pattern = phrase(key_lookups))

Take a look at the output:

> toks_comp %>% dfm() %>% textstat_frequency()
        feature frequency rank docfreq group
1  world_record         2    1       2   all
2  mr_president         2    1       1   all
3            be         1    3       1   all
4          able         1    3       1   all
5            to         1    3       1   all
6         count         1    3       1   all
7            by         1    3       1   all
8          word         1    3       1   all
9           but         1    3       1   all
10         also         1    3       1   all
11      include         1    3       1   all
12         some         1    3       1   all
13      phrases         1    3       1   all
14         such         1    3       1   all
15           as         1    3       1   all
16   super_bass         1    3       1   all

Upvotes: 2

phiver
phiver

Reputation: 23608

First: a warning about your example code: do not create objects that have the same name as functions (like tokens and dfm) this will (eventually) lead to errors and is difficult to debug.

There are probably a few ways of doing this. I created a "normal" tokens object and one ngrams tokens object. both turned into dfm's and from the ngrams dfm, I kept the phrases you wanted. Then combined the dfm's and you can use textstat_frequency as normal.

Note: you can't combine tokens objects like you can combine dfm objects.

library(quanteda)
library(quanteda.textstats)

df_sample<-c("Word Record",
             "be able to count by word",
             "But also include some phrases such as",
             "World Record Super Bass Mr. President Mr. President")



my_tokens <- corpus(df_sample) %>% tokens(remove_punct = TRUE)
my_dfm <- dfm(my_tokens)

# No points as they are removed in the dfm
key_lookups<-c("Mr President", "World Record", "Super Bass" )


my_tokens_ngram <- tokens_ngrams(my_tokens, n = 2, concatenator = " ")

my_dfm_ngrams <- dfm(my_tokens_ngram)

# Only keep the lookups
my_dfm_ngrams <- dfm_keep(my_dfm_ngrams, key_lookups)

# Combine both dfms
my_dfms <- rbind(my_dfm, my_dfm_ngrams)

# if necessary uncomment next part
# my_dfms <- dfm_compress(my_dfms) 

outcome:

head(textstat_frequency(my_dfms), 5)
       feature frequency rank docfreq group
1         word         2    1       2   all
2       record         2    1       2   all
3           mr         2    1       1   all
4    president         2    1       1   all
5 mr president         2    1       1   all

tail(textstat_frequency(my_dfms), 5)
        feature frequency rank docfreq group
18        world         1    6       1   all
19        super         1    6       1   all
20         bass         1    6       1   all
21 world record         1    6       1   all
22   super bass         1    6       1   all

Do note that using rbind on dfms, creates a new document name like "text1.1". If you want this merged back to the original documents, you can call dfm_compress(my_dfms) first and then call textstat_frequency.

Upvotes: 0

Related Questions