Reputation: 35525
While using a simple function to memset CUDA array, I get invalid argument
for big arrays ( around > pow(2,25)
).
I am running on a Tesla k40. I should have enough memory (by far) to allocate the array, and also enough capacity to throw the amount of blocks I am, however the following code exits with an error:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define MAXTHREADS 1024
//http://stackoverflow.com/a/16283216/1485872
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1);} \
} while (0)
__global__ void mymemset(float* image, const float val, size_t N)
{
//http://stackoverflow.com/a/35133396/1485872
size_t tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < N) {
image[tid] = val;
tid += gridDim.x * blockDim.x;
}
}
int main()
{
size_t total_pixels = pow(2, 26) ;
float* d_image;
cudaMalloc(&d_image, total_pixels*sizeof(float));
cudaCheckErrors("Malloc");
dim3 bsz = dim3(MAXTHREADS);
dim3 gsz = dim3(total_pixels / bsz.x + ((total_pixels % bsz.x > 0) ? 1 : 0));
mymemset << <gsz, bsz >> >(d_image, 1.0f, total_pixels);
cudaCheckErrors("mymemset"); //<- error!
cudaDeviceReset();
}
The code works fine up to (and a bit more) pow(2,25)
in total_pixels
but fails for pow(2,26)
.
Coincidentally this is the point where the block size bsz
is 65536
, which seems to be an upper limit in some GPUs, but in the Tesla k40 its supposed to be 2147483647
for the x
dimension, while 65536
for y
and z
(that I am not using). Any insight about the origin of this error?
Compiler flags from VS2013: Properties->CUDA C/C++/command line
# Driver API (NVCC Compilation Type is .cubin, .gpu, or .ptx)
set CUDAFE_FLAGS=--sdk_dir "C:\Program Files (x86)\Windows Kits\8.1\"
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin\nvcc.exe" --use-local-env --cl-version 2013 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin" -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -o Debug\%(Filename)%(Extension).obj "%(FullPath)"
# Runtime API (NVCC Compilation Type is hybrid object or .c file)
set CUDAFE_FLAGS=--sdk_dir "C:\Program Files (x86)\Windows Kits\8.1\"
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin\nvcc.exe" --use-local-env --cl-version 2013 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin" -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -g -Xcompiler "/EHsc /nologo /Zi " -o Debug\%(Filename)%(Extension).obj "%(FullPath)"
Upvotes: 0
Views: 103
Reputation: 72349
You are compiling for the default architecture (sm_20), which has a block size limit of 65535 each dimension of the grid. You must build for sm_35 to be able to launch 2147483647 blocks in a 1D grid.
You should also note that the kernel you are using (which I wrote), could be run with many fewer blocks than (n/blocksize) and still work correctly, and it would be more efficient to do so.
Upvotes: 2