Reputation: 1777
I have the following (snippet) of a kernel.
__global__ void plain(int* geneVec, float* probs, int* nComponents, float* randomNumbers,int *nGenes)
{
int xid = threadIdx.x + (blockDim.x * blockIdx.x);
float* currentProbs= (float*)malloc(sizeof(float)*tmp);
.....
.....
currentProbs[0] = probs[start];
for (k=1;k<nComponents[0]; k++)
{
currentProbs[k] = currentProbs[k-1] + prob;
}
...
...
free(currentProbs);
}
When it's static (even the same sizes) it's very fast, but when CurrentProbs is dynamically allocated (as above) performance is awful.
This question said I could do this inside a kernel: CUDA allocate memory in __device__ function
Here is a related question: Efficiency of Malloc function in CUDA
I was wondering if any other methods have solved this other than the one proposed in the paper? It seems ridiculous that one cannot malloc/free inside a kernel without this sort of penalty.
Upvotes: 8
Views: 26927
Reputation: 15724
I think the reason introducing malloc() slows your code down is that it allocates memory in global memory. When you use a fixed size array, the compiler is likely to put it in the register file, which is much faster.
Having to do a malloc inside your kernel may mean that you're trying to do too much work with a single kernel. If each thread allocates a different amount of memory, then each thread runs a different number of times in the for loop, and you get lots of warp divergence.
If each thread in a warp runs loops the same number of times, just allocate up front. Even if they run a different number of times, you can use a constant size. But instead, I think you should look at how you can refactor your code to entirely remove that loop from your kernel.
Upvotes: 12