Reputation: 1021
I have a dynamic memory allocation inside my kernel:
float MyLongArray1 = new float [array_size]
float MyLongArray2 = new float [array_size]
where array_size
is taken from kernel call. array_size
is on the order of 100000, so quite high.
Memory allocation seems to be working fine. Then I try to do something with the both arrays
for(int i=0; i<array_size; i++)
{
for(int j=0; j<array_size; j++)
{
do some calculations;
}
MyLongArray1[i]=calculation_result1;
MyLongArray2[i]=calculation_result2;
}
The code I've written works fine on 1 core and up to 15 cores. However, when I do 16 cores then I get GPUassert: unspecified launch failure
. cuda-memcheck
still gives 0 errors though. I have made some experiments. When I comment one of the MyLongArray2[i]=calculation_result2;
, the code works again. When I make array_size
half of the previous case, I can increase the number of cores by 2 times. It looks like dynamic allocation takes much more memory? I am on fermi with 3Gb of memory, so my arrays should fit into global memory fine.
What would be possible solutions in this case? Should I avoid dynamic memory allocation for CUDA applications?
Upvotes: 1
Views: 437
Reputation: 4291
In all likelihood, you're exceeding the size of the heap on the device. You can use a cuda API call to fix this.
cudaDeviceSetLimit(cudaLimitMallocHeapSize, n*100000*sizeof(float));
Make sure you do this before any kernel call though. With that said, you should strongly consider using cudaMalloc once to allocate a single large array instead of doing this.
Upvotes: 3