Fatih Bozdağ
Fatih Bozdağ

Reputation: 37

How to count collocations in quanteda based on grouping variables?

I have been working on identifying and classfying collocations over Quenteda package in R.

For instance;

I create token object from a list of documents, and apply collocation analysis.

toks <- tokens(text$abstracts)
collocations <- textstat_collocations(toks)

however, as far as I can see, there is not a clear method to see which collocation(s) is frequent/exist in which document. Even if I apply kwic(toks, pattern = phrase(collocations), selection = 'keep') result will only include rowid as text1, text2 etc.

I would like to group collocation analysis results based on docvars. is it possible with Quanteda ?

Upvotes: 1

Views: 605

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

It sounds like you wish to tally collocations by document. The output from textstat_collocations() already provides counts for each collocation, but these are for the entire corpus.

So the solution to group by document (or any other variable) is to

  1. Get the collocations using textstat_collocations(). Below, I've done that after removing stopwords and punctuation.
  2. Compound the tokens from which the stopwords were formed, using tokens_compound(). This converts each collocation sequence into a single token.
  3. Form a dfm from the compounded tokens, and use textstat_frequency() to count the compounds by document. This is a bit trickier

Implementation using the built-in inaugural corpus:

library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")

toks <- data_corpus_inaugural %>%
  tail(10) %>%
  tokens(remove_punct = TRUE, padding = TRUE) %>%
  tokens_remove(stopwords("en"), padding = TRUE)

colls <- textstat_collocations(toks)
head(colls)
##        collocation count count_nested length   lambda        z
## 1           let us    34            0      2 6.257000 17.80637
## 2  fellow citizens    14            0      2 6.451738 16.18314
## 3 fellow americans    15            0      2 6.221678 16.16410
## 4      one another    14            0      2 6.592755 14.56082
## 5        god bless    15            0      2 8.628894 13.57027
## 6    united states    12            0      2 9.192044 13.22077

Now we compound them and keep only the collocations, then get the frequencies by document:

dfmat <- tokens_compound(toks, colls, concatenator = " ") %>%
  dfm() %>%
  dfm_keep("* *")

That dfm already contains the counts by document of each collocation, but if you want counts in a data.frame format, with a grouping option, use textstat_frequency(). Here I've only output the top two by document, but if you remove the n = 2 then it will give you the frequencies of all collocations by document.

textstat_frequency(dfmat, groups = docnames(dfmat), n = 2) %>%
  head(10)
##             feature frequency rank docfreq        group
## 1   nuclear weapons         4    1       1  1985-Reagan
## 2     human freedom         3    2       1  1985-Reagan
## 3        new breeze         4    1       1    1989-Bush
## 4    new engagement         3    2       1    1989-Bush
## 5            let us         7    1       1 1993-Clinton
## 6  fellow americans         4    2       1 1993-Clinton
## 7            let us         6    1       1 1997-Clinton
## 8       new century         6    1       1 1997-Clinton
## 9  nation's promise         2    1       1    2001-Bush
## 10      common good         2    1       1    2001-Bush

Upvotes: 1

Related Questions