hw.fu
hw.fu

Reputation: 13

using package snow's parRapply: argument missing error

I want to find documents whose similarity between other doucuments are larger than a given value(0.1) by cutting documents into blocks.

library(tm)
data("crude")

sample.dtm <- DocumentTermMatrix(
                    crude, control=list(
                        weighting=function(x) weightTfIdf(x, normalize=FALSE),
                        stopwords=TRUE
                    )
                )

step = 5
n = nrow(sample.dtm)
block = n %/% step 
start = (c(1:block)-1)*step+1
end = start+step-1


j = unlist(lapply(1:(block-1),function(x) rep(((x+1):block),times=1)))
i = unlist(lapply(1:block,function(x) rep(x,times=(block-x))))

ij <- cbind(i,j)

library(skmeans)

getdocs <- function(k){
    ci <- c(start[k[[1]]]:end[k[[1]]])
    cj <- c(start[k[[2]]]:end[k[[2]]])
    combi <- sample.dtm[ci]
    combj < -sample.dtm[cj]

    rownames(combi)<-ci
    rownames(combj)<-cj

    comb<-c(combi,combj)
    sim<-1-skmeans_xdist(comb)

    cat("Block", k[[1]], "with Block", k[[2]], "\n")
    flush.console()

    tri.sim<-upper.tri(sim,diag=F)
    results<-tri.sim & sim>0.1

    docs<-apply(results,1,function(x) length(x[x==TRUE]))
    docnames<-names(docs)[docs>0]

    gc()
    return (docnames)

}

It works well when using apply

system.time(rmdocs<-apply(ij,1,getdocs))

When using parRapply

library(snow)
library(skmeans)
cl<-makeCluster(2)
clusterExport(cl,list("getdocs","sample.dtm","start","end"))
system.time(rmdocs<-parRapply(cl,ij,getdocs))

Error:

 Error in checkForRemoteErrors(val) : 
      2 nodes produced errors; first error: attempt to set 'rownames' on an object with no dimensions
    Timing stopped at: 0.01 0 0.04 

It seems that sample.dtm coundn't be used in parRapply. I'm confused. Can anyone help me? Thanks!

Upvotes: 1

Views: 1894

Answers (1)

Steve Weston
Steve Weston

Reputation: 19677

In addition to exporting objects, you need to load the necessary packages on the cluster workers. In your case, the result of not doing so is that there isn't a dimnames method defined for "DocumentTermMatrix" objects, causing rownames<- to fail.

You can load packages on the cluster workers with the clusterEvalQ function:

clusterEvalQ(cl, { library(tm); library(skmeans) })

After doing that, rownames(combi)<-ci will work correctly.

Also, if you want to see the output from cat, you should use the makeCluster outfile argument:

cl <- makeCluster(2, outfile='')

Upvotes: 1

Related Questions