soriak
soriak

Reputation: 672

cuda "invalid argument" error on second kernel

I got a problem with kernel launches. I had a program using one big kernel. Now I needed to split it into two due to synchronization issues. The first kernel does some init stuff and gets passed a subset of the arguments passed to the second kernel. Running only the first kernel works fine. Running only the second kernels fails while executing it, due to missing initialization but the kernel itself is started. Running both in a row lets the second kernel fail with an "invalid argument" error. I will provide code if necessary but can't figure out right now how it might help. Thanks in advance.

EDIT: here the requested launch code:

void DeviceManager::integrate(){
  assert(hostArgs->neighborhoodsSize > 0);
  size_t maxBlockSize;
  size_t blocks;
  size_t threadsPerBlock;
  // init patch kernel
  maxBlockSize = 64;
  blocks = (hostArgs->patchesSize /maxBlockSize);
  if(0 != hostArgs->patchesSize % maxBlockSize){
    blocks++;
  }
  threadsPerBlock = maxBlockSize;
  std::cout << "blocks: " << blocks << ", threadsPerBlock: " << threadsPerBlock << std::endl;
  initPatchKernel<CUDA_MAX_SPACE_DIMENSION><<<blocks,threadsPerBlock>>>(devicePatches, hostArgs->patchesSize);
  cudaDeviceSynchronize();

  //calc kernel
  maxBlockSize = 64;
  blocks = (hostArgs->neighborhoodsSize /maxBlockSize);
  if(0 != hostArgs->neighborhoodsSize % maxBlockSize){
    blocks++;
  }
  threadsPerBlock = maxBlockSize;
  size_t maxHeapSize = hostArgs->patchesSize * (sizeof(LegendreSpace) + sizeof(LinearSpline)) + hostArgs->neighborhoodsSize * (sizeof(ReactionDiffusionCCLinearForm) + sizeof(ReactionDiffusionCCBiLinearForm));
  std::cout << "maxHeapSize: " << maxHeapSize << std::endl;
  cudaDeviceSetLimit(cudaLimitMallocHeapSize, maxHeapSize);
  std::cout << "blocks: " << blocks << ", threadsPerBlock: " << threadsPerBlock << std::endl;
  integrateKernel<CUDA_MAX_SPACE_DIMENSION><<<blocks,threadsPerBlock>>>(deviceNeighborhoods, hostArgs->neighborhoodsSize, devicePatches, hostArgs->patchesSize, hostArgs->biLinearForms, hostArgs->linearForms, deviceRes);
  cudaDeviceSynchronize();
}

The memory transfers and allocation should not be a problem, since it worked when using only one kernel.

EDIT 2: I check for errors after each kernel call when building in debug mode via a wrapper function. So after each kernel call the following is executed:

cudaError_t cuda_result_code = cudaGetLastError();                        
if (cuda_result_code!=cudaSuccess) {                                      
   fprintf("message: %s\n",cudaGetErrorString(cuda_result_code));
}

Sorry for not mentioning this, the wrapper is not by me so sorry for not pasting the trick. The output right before the failure is the following:

blocks: 1, threadsPerBlock: 64
maxHeapSize: 4480
blocks: 1, threadsPerBlock: 64
message: invalid argument

Upvotes: 1

Views: 7699

Answers (1)

Greg Smith
Greg Smith

Reputation: 11509

cudaDeviceSetLimit

cudaLimitMallocHeapSize controls the size in bytes of the heap used by the malloc() and free() device system calls. Setting cudaLimitMallocHeapSize must be performed before launching any kernel that uses the malloc() or free() device system calls, otherwise cudaErrorInvalidValue will be returned. This limit is only applicable to devices of compute capability 2.0 and higher. Attempting to set this limit on devices of compute capability less than 2.0 will result in the error cudaErrorUnsupportedLimit being returned.

Upvotes: 3

Related Questions