Reputation: 35525
I have a kernel that requires a small amount of memory to operate. However, this memory is input-specific, so I am dynamically allocating it with malloc
inside, but is often a very small vector,rarely more than 5 in length, almost never more than 10. However, every time I execute the program, only 1 value for length exists, it only needs to be dynamic to support different inputs, the length is the same within each execution of the whole program.
(In short, its a depth first tree search where the searched object can be part of several leafs, thus I need to keep track of which is the current search path and how many I checked in each node in the path. Not relevant, just adding some info).
I tested the speed difference between dynamically allocating it and statically giving it a size (int path[6];
) and unsurprisingly the static version is around 15 times faster.
Is there an alternative to dynamically allocating the arrays for my particular case (I know its the same length all the time, for a given execution of the program)? I thought of just statically allocating a vector of e.g. 10 values so I am always safe, but that puts an arbitrary upper limit on my input size, which even if its hard to reach, I rather not do. Any ideas?
Upvotes: 0
Views: 83
Reputation: 72349
Given that
the logical solution is to use templating with the size passed as a template parameter:
template <int lrrysz>
__global__ void thekernel(float *in, float *out)
{
float local[lrrysz];
// your code goes here
}
template __global__ void thekernel<5>(float*, float*);
template __global__ void thekernel<6>(float*, float*);
template __global__ void thekernel<7>(float*, float*);
template __global__ void thekernel<8>(float*, float*);
template __global__ void thekernel<9>(float*, float*);
template __global__ void thekernel<10>(float*, float*);
Doing this brings with it other advantages, particularly that the array size is now a compile time constant. This means that the compiler can deploy a lot of optimizations, like unrolling loops, and if you are extremely lucky, spill the array to registers. It also means the usual compiler controls, like launch bounds, register limits, etc also work as intended and may offer performance improvements at each differing local storage size.
Upvotes: 1