R Text Classification with 800K documents

Question

I have to do some work on text classification that contains 800K texts. I've been trying to run an practical example I found in the following link:

http://garonfolo.dk/herbert/2015/05/r-text-classification-using-a-k-nearest-neighbour-model/

All has been going well until I've got the to the following instruction:

# Transform dtm to matrix to data frame - df is easier to work with
mat.df <- as.data.frame(data.matrix(dtm), stringsAsfactors = FALSE)

After having this run for several hours I've got an error message:

Error: cannot allocate vector of size 583.9 Gb
In addition: Warning messages:
1: In vector(typeof(x$v), prod(nr, nc)) :
  Reached total allocation of 8076Mb: see help(memory.size)
2: In vector(typeof(x$v), prod(nr, nc)) :
  Reached total allocation of 8076Mb: see help(memory.size)
3: In vector(typeof(x$v), prod(nr, nc)) :
  Reached total allocation of 8076Mb: see help(memory.size)
4: In vector(typeof(x$v), prod(nr, nc)) :
  Reached total allocation of 8076Mb: see help(memory.size)

Is there a way to overcome this error?

Would it be possible for example to split data.matrix(dtm) to run the job in chunks and then merge them somehow? Or is it better to approach this in another way or in Python?

Thanks

knb · Accepted Answer

Before that as.data.frame() call, enter this line of code:

dtm <- removeSparseTerms(dtm, sparse=0.9).

The argument sparse=... is a number between 0 and 1. It is proportional to the number of documents you want to keep. Above, it is not 90%. Typically you'll find the correct/optimal value by trial and error. In your case, you can end up with a weird number such as 0.79333. depends on what you want to do.

removeSparseTerms() removes Terms, but keeps the number of documents in the smaller resulting matrix constant. So you'll go from a 12165735 * 800000 element matrix to a 476 * 800000 matrix. Processing this might now be possible on your computer.

If not, try a clever column-wise split-apply-combine trick with your big matrix.

R Text Classification with 800K documents

Answers (1)

Related Questions