Reputation: 89
I am trying to parallelize the bitonic sort with pycuda. For this I use SourceModule and the C code of the parallel bitonic sort. For the memory copies management I use InOut of the pycuda.driver that simplify some of the memory transfers
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from pycuda import gpuarray
import numpy as np
from time import time
ker = SourceModule(
"""
__device__ void swap(int & a, int & b){
int tmp = a;
a = b;
b = tmp;
}
__global__ void bitonicSort(int * values, int N){
extern __shared__ int shared[];
int tid = threadIdx.x + blockDim.x * blockIdx.x;
// Copy input to shared mem.
shared[tid] = values[tid];
__syncthreads();
// Parallel bitonic sort.
for (int k = 2; k <= N; k *= 2){
// Bitonic merge:
for (int j = k / 2; j>0; j /= 2){
int ixj = tid ^ j;
if (ixj > tid){
if ((tid & k) == 0){
//Sort ascending
if (shared[tid] > shared[ixj]){
swap(shared[tid], shared[ixj]);
}
}
else{
//Sort descending
if (shared[tid] < shared[ixj]){
swap(shared[tid], shared[ixj]);
}
}
}
__syncthreads();
}
}
values[tid] = shared[tid];
}
"""
)
N = 8 #lenght of A
A = np.int32(np.random.randint(1, 20, N)) #random numbers in A
BLOCK_SIZE = 256
NUM_BLOCKS = (N + BLOCK_SIZE-1)//BLOCK_SIZE
bitonicSort = ker.get_function("bitonicSort")
t1 = time()
bitonicSort(drv.InOut(A), np.int32(N), block=(BLOCK_SIZE,1,1), grid=(NUM_BLOCKS,1), shared=4*N)
t2 = time()
print("Execution Time {0}".format(t2 - t1))
print(A)
As in the kernel I use extern __shared__
, in pycuda I use the shared parameter with the respective 4*N. Also try using __shared__ int shared[N]
in the kernel but it doesn't work either (check here: Getting started with shared memory on PyCUDA)
Running in Google Collab I get the following error:
/usr/local/lib/python3.6/dist-packages/pycuda/compiler.py in __init__(self, source, nvcc, options, keep, no_extern_c, arch, code, cache_dir, include_dirs)
292
293 from pycuda.driver import module_from_buffer
--> 294 self.module = module_from_buffer(cubin)
295
296 self._bind_module()
LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered
Does anyone know what could be generating this error?
Upvotes: 0
Views: 1530
Reputation: 151849
Your device code isn't accounting for the sizes of your arrays correctly.
You are launching 256 threads in a single block. That means that you will have 256 threads, with tid
numbered 0..255, trying to execute each line of code. For example, in this case:
shared[tid] = values[tid];
You will have, for example, one thread trying to do shared[255] = values[255];
Neither your shared
nor values
array are that large. That is the reason for the illegal memory access error.
The simplest solution for this kind of trivial problem is to make your array sizes match your block size.
BLOCK_SIZE = N
According to my testing, that change clears up any errors and results in a properly sorted array.
It won't work for N
greater than 1024, or multi-block usage, but your code would have to be modified for a multi-block sort, anyway.
If you still have trouble after making that change, I suggest restarting your python session or your colab session.
Upvotes: 1