CUDA IPC Memcpy + MPI fails in Theano, works in pycuda

Question

For learning purposes, I wrote a small C Python module that is supposed to perform an IPC cuda memcopy to transfer data between processes. For testing, I wrote equivalent programs: one using theano's CudaNdarray, and the other using pycuda. The problem is, even though the test programs are nearly identical, the pycuda version works while the theano version does not. It doesn't crash: it just produces incorrect results.

Below is the relevant function in the C module. Here is what it does: every process has two buffers: a source and a destination. Calling _sillycopy(source, dest, n) copies n elements from each process's source buffer to the neighboring process's dest array. So, if I have two processes, 0 and 1, processes 0 will end up with process 1's source buffer and processes 1 will end up with process 0's source buffer.

Note that to transfer cudaIpcMemHandle_t values between processes, I use MPI (this is a small part of a larger project which uses MPI). _sillycopy is called by another function, "sillycopy" which is exposed in Python by the standard Python C API methods.

void _sillycopy(float *source, float* dest, int n, MPI_Comm comm) {
 int localRank;
 int localSize;
 MPI_Comm_rank(comm, &localRank);
 MPI_Comm_size(comm, &localSize);

 //  Figure out which process is to the "left".
 // m() performs a mod and treats negative numbers
 // appropriately 
 int neighbor = m(localRank - 1, localSize); 

 // Create a memory handle for *source and do a
 // wasteful Allgather to distribute to other processes
 // (could just use an MPI_Sendrecv, but irrelevant right now)
 cudaIpcMemHandle_t *memHandles = new cudaIpcMemHandle_t[localSize];
 cudaIpcGetMemHandle(memHandles + localRank, source);
 MPI_Allgather(
  memHandles + localRank, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
  memHandles, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
  comm);

 // Open the neighbor's mem handle so we can do a cudaMemcpy
 float *sourcePtr;
 cudaIpcOpenMemHandle((void**)&sourcePtr, memHandles[neighbor], cudaIpcMemLazyEnablePeerAccess);    

 // Copy!
 cudaMemcpy(dest, sourcePtr, n * sizeof(float), cudaMemcpyDefault);
 cudaIpcCloseMemHandle(sourcePtr);
 delete [] memHandles;
}

Now here is the pycuda example. For reference, using int() on a_gpu and b_gpu returns the pointer to the underlying buffer's memory address on the device.

import sillymodule  # sillycopy lives in here
import simplempi as mpi
import pycuda.driver as drv
import numpy as np
import atexit
import time
mpi.init()
drv.init()
# Make sure each process uses a different GPU
dev = drv.Device(mpi.rank())  
ctx = dev.make_context()
atexit.register(ctx.pop)
shape = (2**26,)

# allocate host memory
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)

# allocate device memory
a_gpu = drv.mem_alloc(a.nbytes)
b_gpu = drv.mem_alloc(b.nbytes)

# copy host to device
drv.memcpy_htod(a_gpu, a)
drv.memcpy_htod(b_gpu, b)

# A few more host buffers
a_p = np.zeros(shape, np.float32)
b_p = np.zeros(shape, np.float32)

# Sanity check: this should fill a_p with 1's
drv.memcpy_dtoh(a_p, a_gpu)
# Verify that
print(a_p[0:10])
sillymodule.sillycopy(
    int(a_gpu),
    int(b_gpu),
    shape[0])

# After this, b_p should have all one's
drv.memcpy_dtoh(b_p, b_gpu) 
print(c_p[0:10])

And now the theano version of the above code. Rather than using int() to get the buffers' address, the CudaNdarray way of accessing this is via the gpudata attribute.

import os
import simplempi as mpi
mpi.init()

# select's one gpu per process
os.environ['THEANO_FLAGS'] = "device=gpu{}".format(mpi.rank())
import theano.sandbox.cuda as cuda
import time
import numpy as np
import time
import sillymodule

shape = (2 ** 24, )

# Allocate host data
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)

# Allocate device data
a_gpu = cuda.CudaNdarray.zeros(shape)
b_gpu = cuda.CudaNdarray.zeros(shape)


# Copy from host to device
a_gpu[:] = a[:]
b_gpu[:] = b[:]

# Should print 1's as a sanity check
print(np.asarray(a_gpu[0:10]))
sillymodule.sillycopy(
    a_gpu.gpudata,
    b_gpu.gpudata,
    shape[0])

# Should print 1's
print(np.asarray(b_gpu[0:10]))

Again, the pycuda code works perfectly and the theano version runs, but gives the wrong result. To be precise, at the end of the theano code, b_gpu is filled with garbage: neither 1's nor 0's, just random numbers as though it were copying from a wrong place in memory.

My original theory regarding why this was failing had to do with CUDA contexts. I wondered if it was possible theano was doing something with them that meant that the cuda calls made in sillycopy were run under a different CUDA context than had been used to create the gpu arrays. I don't think this is the case because:

I spent a lot of time digging deep in theano's code and saw no funny business being played with contexts
I would expect such a problem to result in a bad crash, not an incorrect result, which is not what happens.

A secondary thought is whether this has to do the fact that theano spawns several threads, even when using a cuda backend, which can be verified this by running "ps huH p ". I don't know how threads might affect anything, but I have run out of obvious things to consider.

Any thoughts on this would be greatly appreciated!

For reference: the processes are launched in the normal OpenMPI way:

mpirun --np 2 python test_pycuda.py

CUDA IPC Memcpy + MPI fails in Theano, works in pycuda

Answers (0)

Related Questions