Reputation: 166
With CUDA I am trying to have compile-time optimizations with an array consisting of vectors (int2 in my case), but I am unable to achieve this in a clean manner. Let be more concrete, I am working on a problem which uses two constant arrays c and w. Array w consists of floats and array c consists of int2's. Now since these arrays are constant I want the compiler to perform compile-time optimizations, thereby effectively optimizing away the array accesses. For example, for the following two device functions the compiler unrolls the loop and optimizes away the array accesses by replacing it directly with the values of c and w:
__forceinline__ __device__ float someFunction1() {
const int2 c[9] = {make_int2(0, 0), make_int2(1, 0), make_int2(0, 1), make_int2(-1, 0), make_int2(0, -1),
make_int2(1, 1), make_int2(-1, 1), make_int2(-1, -1), make_int2(1, -1)};
const float w[9] = {4.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f};
#pragma unroll
for (int i = 0; i < 9; ++i) {
//Do something here, accessing c[i] and w[i]
}
}
__forceinline__ __device__ float someFunction2() {
const int2 c[9] = {make_int2(0, 0), make_int2(1, 0), make_int2(0, 1), make_int2(-1, 0), make_int2(0, -1),
make_int2(1, 1), make_int2(-1, 1), make_int2(-1, -1), make_int2(1, -1)};
const float w[9] = {4.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f};
#pragma unroll
for (int i = 0; i < 9; ++i) {
//Do something here, accessing c[i] and w[i]
}
}
Now, the problem is that I don't want to continuously declare c and w in each device function that uses c and w. I can declare w globally, but I am not allowed to declare c globally, because CUDA won't allow me to call the make_int2 constructor in a global variable. That is, the program below gives the error "can't generate code for non empty constructors or destructors on device":
//Declaring array c like this is not allowed
__device__ const int2 c[9] = {make_int2(0, 0), make_int2(1, 0), make_int2(0, 1), make_int2(-1, 0), make_int2(0, -1),
make_int2(1, 1), make_int2(-1, 1), make_int2(-1, -1), make_int2(1, -1)};
__device__ const float w[9] = {4.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f};
__forceinline__ __device__ float someFunction() {
#pragma unroll
for (int i = 0; i < 9; ++i) {
//Do something here, accessing c[i] and w[i]
}
}
My question is: how can I prevent the declaration of c and w in each function that accesses these variables and still have the compile-time optimizations that I want? Or stated otherwise: is there a work-around for declaring an array of vectors globally?
N.B.: I am aware that I can store c and w in global or __constant__ memory, but this won't give me the compile-time optimizations. __constant__ memory may also become problematic when accessed irregularly.
Upvotes: 0
Views: 269
Reputation: 151809
I don't know if this really enables what you're looking for in terms of compiler optimization, but casting the pointer from int to int2 seems to work for me:
#include <stdio.h>
__device__ const int ci[16] = {0, 0, 1, 0, 0, 1, -1, 0, 0, -1, 1, 1, -1, 1, -1, -1};
__device__ const int2 *c = (const int2 *)ci;
__global__ void mykernel(){
int2 temp = c[1];
int2 temp1 = c[4];
printf("c[1].x = %d\n", temp.x);
printf("c[4].y = %d\n", temp1.y);
}
int main(){
mykernel<<<1,1>>>();
cudaDeviceSynchronize();
printf("Done\n");
return 0;
}
Note that const
and __ constant__ are not the same thing. You can eliminate the const
declaration from the variable definitions for c
and ci
, but I assume having it there would help the compiler achieve what you desire.
Upvotes: 1