Global array of vector type work-around

Question

With CUDA I am trying to have compile-time optimizations with an array consisting of vectors (int2 in my case), but I am unable to achieve this in a clean manner. Let be more concrete, I am working on a problem which uses two constant arrays c and w. Array w consists of floats and array c consists of int2's. Now since these arrays are constant I want the compiler to perform compile-time optimizations, thereby effectively optimizing away the array accesses. For example, for the following two device functions the compiler unrolls the loop and optimizes away the array accesses by replacing it directly with the values of c and w:

__forceinline__ __device__ float someFunction1() {
  const int2  c[9] = {make_int2(0, 0), make_int2(1, 0), make_int2(0, 1), make_int2(-1, 0), make_int2(0, -1),
                      make_int2(1, 1), make_int2(-1, 1), make_int2(-1, -1), make_int2(1, -1)};
  const float w[9] = {4.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f};
  #pragma unroll
  for (int i = 0; i < 9; ++i) {
    //Do something here, accessing c[i] and w[i]
  }
}

__forceinline__ __device__ float someFunction2() {
  const int2  c[9] = {make_int2(0, 0), make_int2(1, 0), make_int2(0, 1), make_int2(-1, 0), make_int2(0, -1),
                           make_int2(1, 1), make_int2(-1, 1), make_int2(-1, -1), make_int2(1, -1)};
  const float w[9] = {4.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f};
  #pragma unroll
  for (int i = 0; i < 9; ++i) {
    //Do something here, accessing c[i] and w[i]
  }
}

Now, the problem is that I don't want to continuously declare c and w in each device function that uses c and w. I can declare w globally, but I am not allowed to declare c globally, because CUDA won't allow me to call the make_int2 constructor in a global variable. That is, the program below gives the error "can't generate code for non empty constructors or destructors on device":

//Declaring array c like this is not allowed
__device__ const int2  c[9] = {make_int2(0, 0), make_int2(1, 0), make_int2(0, 1), make_int2(-1, 0), make_int2(0, -1),
                               make_int2(1, 1), make_int2(-1, 1), make_int2(-1, -1), make_int2(1, -1)};
__device__ const float w[9] = {4.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/9.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f, 1.0f/36.0f};

__forceinline__ __device__ float someFunction() {
  #pragma unroll
  for (int i = 0; i < 9; ++i) {
    //Do something here, accessing c[i] and w[i]
  }
}

My question is: how can I prevent the declaration of c and w in each function that accesses these variables and still have the compile-time optimizations that I want? Or stated otherwise: is there a work-around for declaring an array of vectors globally?

N.B.: I am aware that I can store c and w in global or __constant__ memory, but this won't give me the compile-time optimizations. __constant__ memory may also become problematic when accessed irregularly.

Robert Crovella · Accepted Answer

I don't know if this really enables what you're looking for in terms of compiler optimization, but casting the pointer from int to int2 seems to work for me:

#include 

  __device__ const int  ci[16] = {0, 0, 1, 0, 0, 1, -1, 0, 0, -1, 1, 1, -1, 1, -1, -1};

  __device__ const int2 *c = (const int2 *)ci;

  __global__ void mykernel(){

  int2 temp = c[1];
  int2 temp1 = c[4];
  printf("c[1].x = %d
", temp.x);
  printf("c[4].y = %d
", temp1.y);
  }

int main(){

  mykernel<<<1,1>>>();
  cudaDeviceSynchronize();
  printf("Done
");
  return 0;
}

Note that const and __ constant__ are not the same thing. You can eliminate the const declaration from the variable definitions for c and ci, but I assume having it there would help the compiler achieve what you desire.

Global array of vector type work-around

Answers (1)

Related Questions