Reputation: 63
I am looking for assistance with my R code when exporting a DocumentTermMatrix. The file size is too large to export so I was curious if there is a way to set a Frequency to the DTM? For example, only return values in the DTM that have been used 5 or more times.
dtm <- DocumentTermMatrix(alltextclean)
write.csv(as.matrix(dtm), "dtm.csv")
The above produces too large of a file, can I add a frequency to this? I also tried the below but I am left with a list of terms but without a term count (this would also be useful).
termsonly <- findFreqTerms(dtm, 5)
write.csv(termsonly, "termsonly.csv")
Adding a frequency to the above would also be helpful.
Thanks for the help!
Upvotes: 1
Views: 377
Reputation: 46938
I guess you are looking for the total occurrence of each term, across all documents. Using an example dataset:
library(tm)
data(crude)
If your matrix is not so huge, you can do:
dtm = DocumentTermMatrix(crude)
Freq = colSums(as.matrix(dtm))
Otherwise, let's say we take terms with at least 5 occurences:
termsonly <- findFreqTerms(dtm, 5)
Freq = colSums(as.matrix(dtm[,termsonly]))
Or you cast it into a sparseMatrix and sum the columns:
library(Matrix)
Freq = colSums(sparseMatrix(i=dtm$i,j=dtm$j,x=dtm$v,dimnames=dtm$dimnames))
You can also check this post if you like a tidy solution.
Upvotes: 1