NinjaR
NinjaR

Reputation: 623

How to filter term document matrix based on frequency of occurrence of each term

I have a term document matrix. I wish to subset it and keep only those terms which have appeared more than a certain number of times, i.e the row sum should be greater than a specific number. Any quick way to achieve this? B.T.W, the matrix is huge.

Upvotes: 1

Views: 1945

Answers (2)

Ken Benoit
Ken Benoit

Reputation: 14912

In the quanteda package:

require(quanteda)

myDfm <- dfm(data_char_ukimmig2010, remove_punct = TRUE)
myDfm
## Document-feature matrix of: 9 documents, 1,644 features (81.9% sparse).

# remove infrequent terms
dfm_trim(myDfm, min_count = 10, verbose = TRUE)
## Removing features occurring: 
##   - fewer than 10 times: 1,567
##   Total features removed: 1,567 (95.3%).
## Document-feature matrix of: 9 documents, 77 features (32.5% sparse).

Other options exist for removing features based on document frequency, and "sparsity" (a relative measure) as defined in the tm package.

Upvotes: 1

Codutie
Codutie

Reputation: 1117

yes, so in case you are using the tm package there is a findFreqTerms function that you can use where inside the function you can specify the lowfreq you want:

tdm # your term document matrix
your_terms <- findFreqTerms(tdm, lowfreq = [...]) 

in case you are interested in reducing the tdm by the most frequent terms you can do:

tdm[your_terms, ] 

hope this helps

Upvotes: 1

Related Questions