ZappaZ
ZappaZ

Reputation: 105

Efficient parallel 3D rotation

I have a large (1000x1000x5000) 3D numpy array upon which I need to perform many 3D rotations and then compute an asymmetric distance transform. The distance transform is trivially parallelizable, but I need a way to also perform the rotation itself using a computing cluster (which doesn't have so much [e.g. 2GB] memory/core). What's a good strategy to efficiently exploit the computing cluster? (it does not have any GPUs or other specialized hardware for that matter). And yes, I need the rotated volume - meaning that I cannot just simply relabel the coordinates as the asymmetric distance transform will overwrite the dataset several times. The software I'm using on the cluster: python3.4.2 with scipy, numpy and mpi4py.

Thanks!

Upvotes: 1

Views: 605

Answers (1)

rth
rth

Reputation: 11201

If you want to do matrix operations (e.g. a rotation that you could express as a matrix multiplication) in parallel on a cluster, what I would do is.

  1. Compile numpy with a multi-threaded BLAS (e.g. OpenBLAS) so the matrix multiplication is multi-threaded on a node. The advantage is that you know this has been extensively tested and optimized, and you don't need to worry about parallel scaling.
  2. Assuming that the machine has, say 32 cores per node (i.e. 2*32=64 GB of RAM in total). I would run ~4 MPI task per node with 8 threads / MPI task (so the available RAM / task is 16 GB, thus removing the low RAM constraints).
  3. Do a domain decomposition of your array among MPI tasks. For instance, this code (see _mprotate function) does a computation of rotations with scipy.ndimage using multiprocessing, you could do something similar but with mpi4py.

Although the problem is that, unless I'm mistaken, scipy.ndimage.interpolation.rotate does not use matrix operations with BLAS, and is a pure C implementation that in the end calls the NI_GeometricTransform function. So, unless you use a different algorithm, the above approach won't work. You will then have to run as many MPI tasks as you have cores, and do the domain decomposition among them (see mpi4py tutorials).

This does not fully answer your question but hope it helps.

Upvotes: 1

Related Questions