Reputation: 42303
If I have a list of many document term matrices I can do this to combine them:
# setup for example
require(tm)
data("acq")
data("crude")
acq_dtm <- DocumentTermMatrix(acq)
crude_dtm <- DocumentTermMatrix(crude)
# make list of dtms
list_of_dtms <- list(acq_dtm, crude_dtm)
# convert list of dtms into one big dtm
dtms_combined_into_one <- do.call(tm:::c.DocumentTermMatrix, list_of_dtms)
But this seems very slow and memory intensive, and is a major bottleneck when dealing with a few thousand dtms. How can I combine them faster and using less memory?
Since the dtm is a sparse matrix, I wonder if anyone knows of a method for combining sparse matrices that might be useful here. I'm my actual use-case I am not starting with a corpus but lists of word counts.
Here's an rfiddle, in case that's useful for quick testing: http://www.r-fiddle.org/#/fiddle?id=SojC9ZlA (seems promising, but I haven't found it very reliable, is there anything good for this kind of quick prototyping, that can install packages?)
Upvotes: 4
Views: 3299
Reputation: 6365
I do not think there is a trivial way to speed up what you are already doing (maybe there is a clever way). Take a look at str(acq_dtm)
:
List of 6
$ i : int [1:4135] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:4135] 20 33 60 135 187 206 238 256 268 286 ...
$ v : num [1:4135] 1 1 2 1 1 2 2 6 1 1 ...
$ nrow : int 50
$ ncol : int 2103
$ dimnames:List of 2
..$ Docs : chr [1:50] "10" "12" "44" "45" ...
..$ Terms: chr [1:2103] "0.5165" "0.523" "0.8" "100" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
i
points to a document number in the Docs component, and j
points to a term (the first few terms are numbers). v
is the frequency of term j
in document i
. When you do
c(acq_dtm, crude_dtm)
it's more than just stacking up some sparse matrices (that can be done with slam::abind_simple_sparse_array
); the v
components of the two matrices have to be unioned and then the appropriate i
and j
values have to be recomputed.
If I were going to research this more I might have a look at the documentation for slam
.
Also the code for tm:::c.TermDocumentMatrix
shows how tm is doing this calculation; don't know if it's possible to improve it.
Upvotes: 3