how can I parallelize scaling a matrix (using the Seurat package) across multiple computing nodes?

Question

I am working on scaling and clustering a matrix of single-nucleus RNA sequencing data (genes x cells) using the R package Seurat. My data is large, containing 11,500 genes and ~1.5mil cells. Due to the size of the data, the fastest way to scale the matrix would be to parallelize over multiple nodes (each containing 40 cores). I am computing on the Niagara cluster and can request as many cores as needed. My problem is that I can't figure out a way to effectively parallelize my code. I tried using the future package (which is recommended by Seurat) but that confines my data to one node, which is not enough. I also tried Rmpi, however that seemed to assign the same task to all the spawned workers, which was to scale the whole matrix and took too long. I have read about future.batchtools, but haven't been able to figure out the syntax.

I'll include the code I used for Rmpi and future.batchtools. I would appreciate any troubleshooting/alternative strategies to try.

Rmpi:

Seuratdata<-readRDS("/path/seuratobject.RDS")
mpi.universe.size()
mpi.spawn.Rslaves(nslaves=60)
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( np <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )

myfunc(data){
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(data, features=all.genes)
}
Seuratdata<-mpi.remote.exec(cmd=myfunc, data=Seuratdata)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")
mpi.close.Rslaves()
mpi.exit()

future.batchtools:

plan(tweak(batchtools_slurm, workers=80,resources=list(ncpus = 1, memory=10*1024^3,
walltime=10*60*60, partition='batch'), template = "./slurm.tmpl"))
Seuratdata<-readRDS("/path/seuratobject.RDS")
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(Seuratdata, features=all.genes)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")

how can I parallelize scaling a matrix (using the Seurat package) across multiple computing nodes?

Answers (1)

Related Questions