CafeRacer
CafeRacer

Reputation: 63

R - Setting a Frequency to a Document Term Matrix

I am looking for assistance with my R code when exporting a DocumentTermMatrix. The file size is too large to export so I was curious if there is a way to set a Frequency to the DTM? For example, only return values in the DTM that have been used 5 or more times.

dtm <- DocumentTermMatrix(alltextclean)

write.csv(as.matrix(dtm), "dtm.csv")

The above produces too large of a file, can I add a frequency to this? I also tried the below but I am left with a list of terms but without a term count (this would also be useful).

termsonly <- findFreqTerms(dtm, 5)

write.csv(termsonly, "termsonly.csv")

Adding a frequency to the above would also be helpful.

Thanks for the help!

Upvotes: 1

Views: 377

Answers (1)

StupidWolf
StupidWolf

Reputation: 46938

I guess you are looking for the total occurrence of each term, across all documents. Using an example dataset:

library(tm)
data(crude)

If your matrix is not so huge, you can do:

dtm = DocumentTermMatrix(crude)
Freq = colSums(as.matrix(dtm))

Otherwise, let's say we take terms with at least 5 occurences:

termsonly <- findFreqTerms(dtm, 5)
Freq = colSums(as.matrix(dtm[,termsonly]))

Or you cast it into a sparseMatrix and sum the columns:

library(Matrix)
Freq = colSums(sparseMatrix(i=dtm$i,j=dtm$j,x=dtm$v,dimnames=dtm$dimnames))

You can also check this post if you like a tidy solution.

Upvotes: 1

Related Questions