Reputation: 623
I have a term document matrix. I wish to subset it and keep only those terms which have appeared more than a certain number of times, i.e the row sum should be greater than a specific number. Any quick way to achieve this? B.T.W, the matrix is huge.
Upvotes: 1
Views: 1945
Reputation: 14912
In the quanteda package:
require(quanteda)
myDfm <- dfm(data_char_ukimmig2010, remove_punct = TRUE)
myDfm
## Document-feature matrix of: 9 documents, 1,644 features (81.9% sparse).
# remove infrequent terms
dfm_trim(myDfm, min_count = 10, verbose = TRUE)
## Removing features occurring:
## - fewer than 10 times: 1,567
## Total features removed: 1,567 (95.3%).
## Document-feature matrix of: 9 documents, 77 features (32.5% sparse).
Other options exist for removing features based on document frequency, and "sparsity" (a relative measure) as defined in the tm package.
Upvotes: 1
Reputation: 1117
yes, so in case you are using the tm
package there is a findFreqTerms function that you can use where inside the function you can specify the lowfreq you want:
tdm # your term document matrix
your_terms <- findFreqTerms(tdm, lowfreq = [...])
in case you are interested in reducing the tdm by the most frequent terms you can do:
tdm[your_terms, ]
hope this helps
Upvotes: 1