Austin
Austin

Reputation: 1020

Best way to copy 2D or 3D data to Local Memory

I am starting to do a lot of work in 3D for my OpenCL kernels for filtering. Is there an optimum way to copy a 2D or 3D subset from global memory into local or private memory?

The use for this could be to take a 3D dataset and apply a 3D kernel (or operate on the space occupied by the 3D kernel). Each thread is going to look at one pixel, crop the data around the pixel in 3 dimensions that is the size of a kernel (say 1, 3, 5, etc), copy this subset of data to local or private memory, and then compute, for example, the Standard Deviation of the subset of data.

The easiest and least efficient way is just by brute force:

__kernel void Filter_3D_StdDev(__global float *Data_3D_In,
                               int KernelSize){
//Note: KernelSize is always ODD

int k = get_global_id(0); //also z
int j = get_global_id(1); //also y
int i = get_global_id(2); //also x

//Convert 3D to 1D
int linear_coord = i + get_global_size(0)*j + get_global_size(0)*get_global_size(1)*k;

//private memory
float Subset[KernelSize*KernelSize*KernelSize];

int HalfKernel = (KernelSize - 1)/2; //compute the pixel radius

for(int z = -HalfKernel ; z < HalfKernel; z++){
     for(int y = -HalfKernel ; y < HalfKernel; y++){
          for(int x = -HalfKernel ; z < HalfKernel; x++){
               int index = (i + x) + get_global_size(0)*(j + y) + \            
                                  get_global_size(0)*get_global_size(1)*(k + z);
               Subset[x + HalfKernel + (y + HalfKernel)*KernelSize + (z + HalfKernel)*KernelSize*KernelSize] = Data_3D_In[index];
          }

     }
}

//Filter subset here

}

This is horribly in-efficient since so many calls are made to global memory. Is there a way to improve this?

My first thought is to use vload to reduce the number of loops, such as:

__kernel void Filter_3D_StdDev(__global float *Data_3D_In,
                               int KernelSize){
//Note: KernelSize is always ODD

int k = get_global_id(0); //also z
int j = get_global_id(1); //also y
int i = get_global_id(2); //also x

//Convert 3D to 1D
int linear_coord = i + get_global_size(0)*j + get_global_size(0)*get_global_size(1)*k;

//private memory
float Subset[KernelSize*KernelSize];

int HalfKernel = (KernelSize - 1)/2; //compute the pixel radius

for(int z = -HalfKernel ; z < HalfKernel; z++){
     for(int y = -HalfKernel ; y < HalfKernel; y++){
          //##TODO##
          //Automatically determine which vload to use based on Kernel Size
          //for now, use vload3
               int index = (i + -HalfKernel) + get_global_size(0)*(j + y) + \            
                                  get_global_size(0)*get_global_size(1)*(k + z);
               int subset_index = (z + HalfKernel)*KernelSize*KernelSize
               float3 temp = vload3(index, Data_3D_In);
               vstore3(temp, subset_index, Subset);

     }
}

//Filter subset here

}

Is there an even better way?

Thanks in Advance!

Upvotes: 0

Views: 1653

Answers (1)

Jim V
Jim V

Reputation: 263

First off you need to unroll those loops. You will have to make several copies of the function or do string replacement before you compile, or unroll the loops first but just as a test do:

#define HALF_KERNEL_SIZE = 2
#pragma unroll HALF_KERNEL_SIZE * 2 + 1
for(int z = -HALF_KERNEL_SIZE ; z < HALF_KERNEL_SIZE ; z++){
    #pragma unroll HALF_KERNEL_SIZE * 2 + 1
    for(int y = -HALF_KERNEL_SIZE ; y < HALF_KERNEL_SIZE ; y++){

For the GPU you should read it into local memory (especially for the 5x5x5 ones because you are reading back into global memory A LOT when you already have the data and you don't want to go back to get it. (This is for the GPU) for the CPU it is not as big of an issue.

So do this exactly as you would do for convolution but with an extra dimension:

1. Read in a block (or cube) of memory into local memory for a number of threads.
2. Create a barrier to make sure all data is read before you continue.
3. Sample into your local memory using your local id as an offset.
4. Test various local workgroup sizes until you get best performance

Everything else is the same. For the larger kernels with a bigger overlap this will be orders of manatudes faster.

Upvotes: 1

Related Questions