Reputation: 1011
I am analyzing text data from a round table, and I would like to know if it is possible to filter only those documents which have more than "n" terms?
My corpus has documents which contains only 1 word, such as: "Thanks", "Sometimes", "Really", "go". I would like to remove then in order to decrease sparsity.
I tried dfm_trim
from quanteda
but I couldn't handle it:
corpus_post80inaug <- corpus_subset(data_corpus_inaugural, Year > 1980)
dfm <- dfm(corpus_post80inaug)
ntoken(dfm)
1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton
2790 2921 2681 1833 2449
2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump
1808 2319 2711 2317 1660
dfm <- dfm_trim(dfm, min_docfreq = 2000)
ntoken(dfm)
1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton
0 0 0 0 0
2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump
0 0 0 0 0
I would expect that only 1993-Clinton, 2001-Bush and 2017-Trump would have 0, or get rid off dfm
.
Obs.: This example is only for illustration purpose, it is not the text data I am analyzing.
Upvotes: 1
Views: 956
Reputation: 23598
You should use dfm_subset
, not dfm_trim
. dfm_trim
calculates frequencies across all documents, not per document. Though you can specify that the minimum (or maximum) documents that the term should appear in. For removing documents, we use dfm_subset
.
corpus_post80inaug <- corpus_subset(data_corpus_inaugural, Year > 1980)
dfm <- dfm(corpus_post80inaug)
# remove documents with less than 2000 tokens.
my_dfm <- dfm_subset(dfm, ntoken(dfm) >= 2000)
ntoken(my_dfm)
1981-Reagan 1985-Reagan 1989-Bush 1997-Clinton 2005-Bush 2009-Obama 2013-Obama
2790 2921 2681 2449 2319 2711 2317
Upvotes: 2