ALFRAM
ALFRAM

Reputation: 193

Optimizing CUDA interpolation

I have developped the following interpolation with CUDA and I am looking for a way of improving this interpolation. For some reasons, I dont want to use CUDA textures.

The other point that I have noticed that for some unknown reasons, is that the interpolation is not performed on the whole vector in my case if the size of the vector is superior than the number of threads (for example with a vector of size 1000, and a number of threads equal to 512,. A thread does its first job and that’s all. I would like to optimize the singleInterp function.

Here is my code:

__device__ float singleInterp(float* data, float x, int lx_data) { 

float res = 0;    
int i1=0;
int j=lx_data; 
int imid;

while (j>i1+1)  
{  
    imid = (int)(i1+j+1)/2;
    if (data[imid]<x) 
         i1=imid;
    else 
        j=imid;
} 
if (i1==j)
    res = data[i1+lx_data];
else
    res =__fmaf_rn( __fdividef(data[j+lx_data]-data[i1+lx_data],(data[j]-data[i1])),x-data[i1], data[i1+lx_data]);

return res;

}

Kernel:

__global__ void linearInterpolation(float* data, float* x_in, int lx_data) {

int i = threadIdx.x + blockDim.x * blockIdx.x; 
int index = i; 
if (index < lx_data) 
    x_in[index] = singleInterp(data, x_in[index], lx_data);
}

Upvotes: 0

Views: 863

Answers (1)

Vitality
Vitality

Reputation: 21455

It seems that you are interested in 1D linear interpolation. I already had the problem of optimizing such a kind of interpolation and I ended up with the following code

__global__ void linear_interpolation_kernel_function_GPU(double* __restrict__ result_d, const double* __restrict__ data_d, const double* __restrict__ x_out_d, const int M, const int N)
{
    int j = threadIdx.x + blockDim.x * blockIdx.x;

    if(j<N)
    {
        double reg_x_out = x_out_d[j/2]+M/2;
        int k = floor(reg_x_out);
        double a = (reg_x_out)-floor(reg_x_out);
        double dk = data_d[2*k+(j&1)];
        double dkp1 = data_d[2*k+2+(j&1)];
        result_d[j] = a * dkp1 + (-dk * a + dk);
    } 
 }

The data are assumed to be sampled at integer nodes between -M/2 and M/2. The code is "equivalent" to 1D texture interpolation, as explained at the following web-page. For the 1D linear texture interpolation, see Fig. 13 of the CUDA-Programming-Guide. For comparisons betwee different solutions, please see the following thread.

Upvotes: 1

Related Questions