Ben
Ben

Reputation: 42303

Efficiently combine many Document Term Matrices

If I have a list of many document term matrices I can do this to combine them:

# setup for example
require(tm)
data("acq")
data("crude")
acq_dtm <- DocumentTermMatrix(acq)
crude_dtm <- DocumentTermMatrix(crude)
# make list of dtms
list_of_dtms <- list(acq_dtm, crude_dtm)
# convert list of dtms into one big dtm
dtms_combined_into_one <- do.call(tm:::c.DocumentTermMatrix, list_of_dtms)

But this seems very slow and memory intensive, and is a major bottleneck when dealing with a few thousand dtms. How can I combine them faster and using less memory?

Since the dtm is a sparse matrix, I wonder if anyone knows of a method for combining sparse matrices that might be useful here. I'm my actual use-case I am not starting with a corpus but lists of word counts.

Here's an rfiddle, in case that's useful for quick testing: http://www.r-fiddle.org/#/fiddle?id=SojC9ZlA (seems promising, but I haven't found it very reliable, is there anything good for this kind of quick prototyping, that can install packages?)

Upvotes: 4

Views: 3299

Answers (1)

James King
James King

Reputation: 6365

I do not think there is a trivial way to speed up what you are already doing (maybe there is a clever way). Take a look at str(acq_dtm):

List of 6
 $ i       : int [1:4135] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:4135] 20 33 60 135 187 206 238 256 268 286 ...
 $ v       : num [1:4135] 1 1 2 1 1 2 2 6 1 1 ...
 $ nrow    : int 50
 $ ncol    : int 2103
 $ dimnames:List of 2
  ..$ Docs : chr [1:50] "10" "12" "44" "45" ...
  ..$ Terms: chr [1:2103] "0.5165" "0.523" "0.8" "100" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

i points to a document number in the Docs component, and j points to a term (the first few terms are numbers). v is the frequency of term j in document i. When you do

c(acq_dtm, crude_dtm)

it's more than just stacking up some sparse matrices (that can be done with slam::abind_simple_sparse_array); the v components of the two matrices have to be unioned and then the appropriate i and j values have to be recomputed.

If I were going to research this more I might have a look at the documentation for slam.

Also the code for tm:::c.TermDocumentMatrix shows how tm is doing this calculation; don't know if it's possible to improve it.

Upvotes: 3

Related Questions