Reputation: 1076
I am running the following on RStudio server on Ubuntu:
library(tm)
strings = sapply(1:1000, function(x){ paste(sample(c(letters[1:4], " "), 100, replace=T), collapse="")})
corp = VCorpus(VectorSource(strings))
dtm = DocumentTermMatrix(corp)
It took me a few hours to realize DocumentTermMatrix was causing problems. Each successive sourcing of my RStudio document (and same for command line etc) will create 2 more processes, e.g. I have 13 R processes now. If i comment out the dtm line, I never see more processes created.
Could this be related to the recent-ish introduction of parallel somethingsomething in the tm package? I was using .5 or so, and am now using .6, but am seeing the same behavior.
Just to be clear, this code runs fine. The results come back correctly either way, but it's the lingering processes I'm concerned about.
Upvotes: 2
Views: 160
Reputation: 1
Short answer given by Kevin in the comments:
set
options(mc.cores=1)
A bit more details: tm
package functions use parallelization. There are number of warnings for using it, refer the documentation. Sometimes it makes sense to use a single process by setting option mc.cores=1
.
Note that different tm
functions use different syntax for that, for example:
tm_map(corp, stemDocument, mc.cores=1)
vs.
DocumentTermMatrix(corp, options(mc.cores=1))
Upvotes: 0