Reputation: 419
I am running a fitness function for 1024 matrices, each matrix gets its own block and is the same size. Each block has n*n
threads (the dimension of the matrix) and needs to have n*n
shared memory so that I can do an easy sum reduction. However, the dimension n
for all the matrices is variable before runtime (ie. it can be manually changed, though always a power of 2 so the summation is simple). The problem here is that shared memory must be allocated using a constant, but I also need the value to pass to the kernel from the host. Where do I declare the dimension n
so that it is visible to the CPU (for passing to the kernel) and can be used to declare the size of the shared memory (within the kernel)?
My code is structured like this:
from main.cu
I call the kernel:
const int num_states = 1024
const int dimension = 4
fitness <<< num_states, dimension * dimension >>> (device_array_of_states, dimension, num_states, device_fitness_return);
and then in kernel.cu
I have:
__global__ void fitness(
int *numbers,
int dimension,
int num_states,
int *fitness_return) {
__shared__ int fitness[16]; <<-- needs to be dimension * dimension
//code
}
numbers
is an array representing 1024 matrices, dimension
is the row and column length, num_states
is 1024, fitness_return
is an array with length 1024 that holds the fitness value for each matrix. In the kernel, the shared memory is hard coded with the square of dimension
(so dimension
is 4 in this example).
Where and how can I declare dimension
so that it can be used to allocate shared memory as well as call the kernel, this way I only have to update dimension
in one place? Thanks for your help.
Upvotes: 2
Views: 2372
Reputation: 132240
The amount of allocated shared memory is uniform over all blocks. You might be using a different amount of shared memory in each block, but it's still all available. Also, the amount of shared memory is rather limited regardless, so n*n elements cannot exceed the maximum amount of space (typically 48KiB); for float
-type elements (4 bytes each) that would mean n < 340 or so.
Now, there are two ways to allocate shared memory: Static and Dynamic.
Static allocation is what you gave as an example, which would not work:
__shared__ int fitness[16];
in these cases, the size must be known at compile-time (at device-side code compile time) - which is not the case for you.
With Dynamic shared memory allocation, you don't specify the size in the kernel code - you leave it empty and prepend extern
:
extern __shared__ int fitness[];
Instead, you specify the amount when launching the kernel, and the threads of the different blocks don't necessarily know what it is.
But in your case, the threads do need to know what n is. Well, just pass it as a kernel argument. So,
__global__ void fitness(
int *numbers,
int dimension,
int num_states,
int *fitness_return,
unsigned short fitness_matrix_order /* that's your n*/)
{
extern __shared__ int fitness[];
/* ... etc ... */
}
nVIDIA's Parallel-for-all blog has a nice post with a more in-depth introduction to using shared memory, which specifically covers static and dynamic shared memory allocation.
Upvotes: 6