Efficiently combine many Document Term Matrices

Question

If I have a list of many document term matrices I can do this to combine them:

# setup for example
require(tm)
data("acq")
data("crude")
acq_dtm <- DocumentTermMatrix(acq)
crude_dtm <- DocumentTermMatrix(crude)
# make list of dtms
list_of_dtms <- list(acq_dtm, crude_dtm)
# convert list of dtms into one big dtm
dtms_combined_into_one <- do.call(tm:::c.DocumentTermMatrix, list_of_dtms)

But this seems very slow and memory intensive, and is a major bottleneck when dealing with a few thousand dtms. How can I combine them faster and using less memory?

Since the dtm is a sparse matrix, I wonder if anyone knows of a method for combining sparse matrices that might be useful here. I'm my actual use-case I am not starting with a corpus but lists of word counts.

Here's an rfiddle, in case that's useful for quick testing: http://www.r-fiddle.org/#/fiddle?id=SojC9ZlA (seems promising, but I haven't found it very reliable, is there anything good for this kind of quick prototyping, that can install packages?)

Efficiently combine many Document Term Matrices

Answers (1)

Related Questions