suribe06
suribe06

Reputation: 89

PyCUDA LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered

I am trying to parallelize the bitonic sort with pycuda. For this I use SourceModule and the C code of the parallel bitonic sort. For the memory copies management I use InOut of the pycuda.driver that simplify some of the memory transfers

import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from pycuda import gpuarray
import numpy as np
from time import time

ker = SourceModule(
    """
    __device__ void swap(int & a, int & b){
        int tmp = a;
        a = b;
        b = tmp;
    }

    __global__ void bitonicSort(int * values, int N){
        extern __shared__ int shared[];
        int tid = threadIdx.x + blockDim.x * blockIdx.x;
        // Copy input to shared mem.
        shared[tid] = values[tid];
        __syncthreads();
        // Parallel bitonic sort.
        for (int k = 2; k <= N; k *= 2){
            // Bitonic merge:
            for (int j = k / 2; j>0; j /= 2){
                int ixj = tid ^ j;
                if (ixj > tid){
                    if ((tid & k) == 0){
                        //Sort ascending
                        if (shared[tid] > shared[ixj]){
                            swap(shared[tid], shared[ixj]);
                        }
                    }
                    else{
                        //Sort descending
                        if (shared[tid] < shared[ixj]){
                            swap(shared[tid], shared[ixj]);
                        }
                    }
                }
                __syncthreads();
            }
        }
        values[tid] = shared[tid];
    }
    """
)

N = 8 #lenght of A
A = np.int32(np.random.randint(1, 20, N)) #random numbers in A
BLOCK_SIZE = 256
NUM_BLOCKS = (N + BLOCK_SIZE-1)//BLOCK_SIZE
bitonicSort = ker.get_function("bitonicSort")
t1 = time()
bitonicSort(drv.InOut(A), np.int32(N), block=(BLOCK_SIZE,1,1), grid=(NUM_BLOCKS,1), shared=4*N)
t2 = time()
print("Execution Time {0}".format(t2 - t1))
print(A)

As in the kernel I use extern __shared__, in pycuda I use the shared parameter with the respective 4*N. Also try using __shared__ int shared[N] in the kernel but it doesn't work either (check here: Getting started with shared memory on PyCUDA)

Running in Google Collab I get the following error:

/usr/local/lib/python3.6/dist-packages/pycuda/compiler.py in __init__(self, source, nvcc, options, keep, no_extern_c, arch, code, cache_dir, include_dirs)
    292 
    293         from pycuda.driver import module_from_buffer
--> 294         self.module = module_from_buffer(cubin)
    295 
    296         self._bind_module()

LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered

Does anyone know what could be generating this error?

Upvotes: 0

Views: 1530

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151849

Your device code isn't accounting for the sizes of your arrays correctly.

You are launching 256 threads in a single block. That means that you will have 256 threads, with tid numbered 0..255, trying to execute each line of code. For example, in this case:

shared[tid] = values[tid]; 

You will have, for example, one thread trying to do shared[255] = values[255];

Neither your shared nor values array are that large. That is the reason for the illegal memory access error.

The simplest solution for this kind of trivial problem is to make your array sizes match your block size.

BLOCK_SIZE = N

According to my testing, that change clears up any errors and results in a properly sorted array.

It won't work for N greater than 1024, or multi-block usage, but your code would have to be modified for a multi-block sort, anyway.

If you still have trouble after making that change, I suggest restarting your python session or your colab session.

Upvotes: 1

Related Questions