Reputation: 23
I have a 3D matrix where the x-y plane(s) represent an image and the z-plane represents image layers.
The issue is when I try to extract the first (or other layers) using idz, I do not get the expected results. It looks like the array, once in CUDA, has different indexes for x, y or z than what I expect (as in pycuda). I see this by the result array below.
The following is a step by step process for this mini example (I used generic int numbers to represent my images to save uploading images and the entire code)!
Here I import libraries and define image size and layers...
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
from pycuda.gpuarray import to_gpu
row = 10
column = 10
depth = 5
Then I define my input 3D array and my output 2D array...
#--==== Input 3D Array ====---
arrayA = numpy.full((row, column, depth), 0)
#populate each layer with fixed values
for i in range(depth):
arrayA[:,:,i] = i + 1
arrayA = arrayA.astype(numpy.uint16)
arrayA_gpu = cuda.mem_alloc(arrayA.nbytes)
cuda.memcpy_htod(arrayA_gpu, arrayA)
arrayA_Answer = numpy.empty_like(arrayA)
#--==== Output 2D array container ====---
arrayB = numpy.zeros([row, column], dtype = numpy.uint16)
arrayB_gpu = cuda.mem_alloc(arrayB.nbytes)
cuda.memcpy_htod(arrayB_gpu, arrayB)
arrayB_Answer = numpy.empty_like(arrayB)
Next I define the CUDA kernal and function in pycuda
mod = SourceModule("""
__global__ void getLayer(int *arrayA, int *arrayB)
{
int idx = threadIdx.x + (blockIdx.x * blockDim.x); // x coordinate (numpy axis 2)
int idy = threadIdx.y + (blockIdx.y * blockDim.y); // y coordinate (numpy axis 1)
int idz = 0; //The first layer, this can set in range from 0-4
int x_width = (blockDim.x * gridDim.x);
int y_width = (blockDim.y * gridDim.y);
arrayB[idx + (x_width * idy)] = arrayA[idx + (x_width * idy) + (x_width * y_width) * idz];
}
""")
func = mod.get_function("getLayer")
func(arrayA_gpu, arrayB_gpu, block=(row, column, 1), grid=(1,1))
Using standard pycuda commands, I extract the results (not what I expected)
arrayA[:,:,0] = 10x10 matrix populated with 1's (good)
print(arrayA_Answer[:,:,0])
[[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]]
arrayB[:,:] = 10x10 matrix populated with the following (bad), expected to be equal to arrayA[:,:,0]...
print(arrayB_Answer)
[[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]]
Upvotes: 0
Views: 430
Reputation: 151849
As discussed here, the numpy 3D storage order pattern is that the "z
" (i.e. "3rd") index is the rapidly varying index, as you progress linearly through memory. Your code assumes that the first index ("x
") is the rapidly varying one.
Since your kernel is already organized for efficient ("coalesced") load/store behavior, you could address this by reordering the storage of your images/layers/slices in numpy. Here is a worked example:
$ cat t10.py
from __future__ import print_function
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
from pycuda.gpuarray import to_gpu
row = 5
column = 10
depth = 10
#--==== Input 3D Array ====---
arrayA = numpy.full((row, column, depth), 0)
my_slice=numpy.int32(3) # choose the layer
#populate each layer with fixed values
for i in range(row):
arrayA[i,:,:] = i + 1
arrayA = arrayA.astype(numpy.int32)
arrayA_gpu = cuda.mem_alloc(arrayA.nbytes)
cuda.memcpy_htod(arrayA_gpu, arrayA)
arrayA_Answer = numpy.empty_like(arrayA)
#--==== Output 2D array container ====---
arrayB = numpy.zeros([column, depth], dtype = numpy.int32)
arrayB_gpu = cuda.mem_alloc(arrayB.nbytes)
cuda.memcpy_htod(arrayB_gpu, arrayB)
arrayB_Answer = numpy.empty_like(arrayB)
mod = SourceModule("""
__global__ void getLayer(int *arrayA, int *arrayB, int slice)
{
int idx = threadIdx.x + (blockIdx.x * blockDim.x); // x coordinate (numpy axis 2)
int idy = threadIdx.y + (blockIdx.y * blockDim.y); // y coordinate (numpy axis 1)
int idz = slice; //The "layer"
int x_width = (blockDim.x * gridDim.x);
int y_width = (blockDim.y * gridDim.y);
arrayB[idx + (x_width * idy)] = arrayA[idx + (x_width * idy) + (x_width * y_width) * idz];
}
""")
func = mod.get_function("getLayer")
func(arrayA_gpu, arrayB_gpu, my_slice, block=(depth, column, 1), grid=(1,1))
cuda.memcpy_dtoh(arrayB_Answer,arrayB_gpu)
print(arrayA[my_slice,:,:])
print(arrayB_Answer[:,:])
$ python t10.py
[[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]]
[[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]]
$
Note that I have also changed your use of uint16
to int32
, to match the kernel type int
.
Upvotes: 1