Ander Biguri
Ander Biguri

Reputation: 35525

Error in kernel when using big arrays

While using a simple function to memset CUDA array, I get invalid argument for big arrays ( around > pow(2,25)).

I am running on a Tesla k40. I should have enough memory (by far) to allocate the array, and also enough capacity to throw the amount of blocks I am, however the following code exits with an error:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h> 
#include <math.h>

#define MAXTHREADS 1024
//http://stackoverflow.com/a/16283216/1485872
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
                exit(1);} \
        } while (0)

__global__ void mymemset(float* image, const float val, size_t N)
{
    //http://stackoverflow.com/a/35133396/1485872
    size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
    while (tid < N) {
        image[tid] = val;
        tid += gridDim.x * blockDim.x;
    }
}


int main()
{

    size_t total_pixels = pow(2, 26) ;
    float* d_image;
    cudaMalloc(&d_image, total_pixels*sizeof(float));
    cudaCheckErrors("Malloc");

    dim3 bsz = dim3(MAXTHREADS);
    dim3 gsz = dim3(total_pixels / bsz.x + ((total_pixels % bsz.x > 0) ? 1 : 0));
    mymemset << <gsz, bsz >> >(d_image, 1.0f, total_pixels);
    cudaCheckErrors("mymemset"); //<- error!
    cudaDeviceReset();

    }

The code works fine up to (and a bit more) pow(2,25) in total_pixels but fails for pow(2,26).

Coincidentally this is the point where the block size bsz is 65536, which seems to be an upper limit in some GPUs, but in the Tesla k40 its supposed to be 2147483647 for the x dimension, while 65536 for y and z (that I am not using). Any insight about the origin of this error?

Compiler flags from VS2013: Properties->CUDA C/C++/command line

# Driver API (NVCC Compilation Type is .cubin, .gpu, or .ptx)
set CUDAFE_FLAGS=--sdk_dir "C:\Program Files (x86)\Windows Kits\8.1\"
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin\nvcc.exe" --use-local-env --cl-version 2013 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin"     -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile -cudart static  -o Debug\%(Filename)%(Extension).obj "%(FullPath)"

# Runtime API (NVCC Compilation Type is hybrid object or .c file)
set CUDAFE_FLAGS=--sdk_dir "C:\Program Files (x86)\Windows Kits\8.1\"
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin\nvcc.exe" --use-local-env --cl-version 2013 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin"     -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile -cudart static  -g    -Xcompiler "/EHsc  /nologo  /Zi   " -o Debug\%(Filename)%(Extension).obj "%(FullPath)"

Upvotes: 0

Views: 103

Answers (1)

talonmies
talonmies

Reputation: 72349

You are compiling for the default architecture (sm_20), which has a block size limit of 65535 each dimension of the grid. You must build for sm_35 to be able to launch 2147483647 blocks in a 1D grid.

You should also note that the kernel you are using (which I wrote), could be run with many fewer blocks than (n/blocksize) and still work correctly, and it would be more efficient to do so.

Upvotes: 2

Related Questions