Guilherme Parreira
Guilherme Parreira

Reputation: 1011

How to filter a dfm by documents with at least n terms in quanteda?

I am analyzing text data from a round table, and I would like to know if it is possible to filter only those documents which have more than "n" terms?

My corpus has documents which contains only 1 word, such as: "Thanks", "Sometimes", "Really", "go". I would like to remove then in order to decrease sparsity.

I tried dfm_trim from quanteda but I couldn't handle it:

corpus_post80inaug <- corpus_subset(data_corpus_inaugural, Year > 1980)
dfm <- dfm(corpus_post80inaug)
ntoken(dfm)
1981-Reagan  1985-Reagan    1989-Bush 1993-Clinton 1997-Clinton 
       2790         2921         2681         1833         2449 
  2001-Bush    2005-Bush   2009-Obama   2013-Obama   2017-Trump 
       1808         2319         2711         2317         1660 
dfm <- dfm_trim(dfm, min_docfreq = 2000)
ntoken(dfm)
1981-Reagan  1985-Reagan    1989-Bush 1993-Clinton 1997-Clinton 
          0            0            0            0            0 
  2001-Bush    2005-Bush   2009-Obama   2013-Obama   2017-Trump 
          0            0            0            0            0 

I would expect that only 1993-Clinton, 2001-Bush and 2017-Trump would have 0, or get rid off dfm. Obs.: This example is only for illustration purpose, it is not the text data I am analyzing.

Upvotes: 1

Views: 956

Answers (1)

phiver
phiver

Reputation: 23598

You should use dfm_subset, not dfm_trim. dfm_trim calculates frequencies across all documents, not per document. Though you can specify that the minimum (or maximum) documents that the term should appear in. For removing documents, we use dfm_subset.

corpus_post80inaug <- corpus_subset(data_corpus_inaugural, Year > 1980)
dfm <- dfm(corpus_post80inaug)

# remove documents with less than 2000 tokens. 
my_dfm <- dfm_subset(dfm, ntoken(dfm) >= 2000)

ntoken(my_dfm)
 1981-Reagan  1985-Reagan    1989-Bush 1997-Clinton    2005-Bush   2009-Obama   2013-Obama 
        2790         2921         2681         2449         2319         2711         2317 

Upvotes: 2

Related Questions