dcc310
dcc310

Reputation: 1076

tm document term matrix: processes wont terminate

I am running the following on RStudio server on Ubuntu:

library(tm)
strings = sapply(1:1000, function(x){ paste(sample(c(letters[1:4], " "), 100, replace=T),   collapse="")})
corp = VCorpus(VectorSource(strings))
dtm = DocumentTermMatrix(corp)

It took me a few hours to realize DocumentTermMatrix was causing problems. Each successive sourcing of my RStudio document (and same for command line etc) will create 2 more processes, e.g. I have 13 R processes now. If i comment out the dtm line, I never see more processes created.

Could this be related to the recent-ish introduction of parallel somethingsomething in the tm package? I was using .5 or so, and am now using .6, but am seeing the same behavior.

Just to be clear, this code runs fine. The results come back correctly either way, but it's the lingering processes I'm concerned about.

Upvotes: 2

Views: 160

Answers (1)

sergeu
sergeu

Reputation: 1

Short answer given by Kevin in the comments:

set options(mc.cores=1)

A bit more details: tm package functions use parallelization. There are number of warnings for using it, refer the documentation. Sometimes it makes sense to use a single process by setting option mc.cores=1.

Note that different tm functions use different syntax for that, for example:

 tm_map(corp, stemDocument, mc.cores=1)

vs.

 DocumentTermMatrix(corp, options(mc.cores=1))

Upvotes: 0

Related Questions