solvingPuzzles
solvingPuzzles

Reputation: 8889

New CUDA Texture Object -- getting wrong data in 2D case

In CUDA 5.0, NVIDIA added a "texture object" (cudaTextureObject_t) that makes textures a bit easier to work with. Previously, it was necessary to define textures as global variables.


I followed this NVIDIA example on using the cudaTextureObject_t. It works properly for the 1D case. I tried to extend the example to work on 2D pitched memory:

#define WIDTH 6
#define HEIGHT 2
int width = WIDTH; int height = HEIGHT;
float h_buffer[12] = {1,2,3,4,5,6,7,8,9,10,11,12};
float* d_buffer;
size_t pitch;
cudaMallocPitch(&d_buffer, &pitch, sizeof(float)*width, height);
cudaMemcpy2D(d_buffer, pitch, &h_buffer, sizeof(float)*width, sizeof(float)*width, height, cudaMemcpyHostToDevice);
printf("pitch = %d \n", pitch);

//CUDA 5 texture objects: https://developer.nvidia.com/content/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility
cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypePitch2D;
resDesc.res.pitch2D.devPtr = d_buffer;
resDesc.res.pitch2D.pitchInBytes =  pitch;
resDesc.res.pitch2D.width = width;
resDesc.res.pitch2D.height = height;
resDesc.res.pitch2D.desc.f = cudaChannelFormatKindFloat;
resDesc.res.pitch2D.desc.x = 32; // bits per channel 
resDesc.res.pitch2D.desc.y = 32; 
cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.readMode = cudaReadModeElementType;
cudaTextureObject_t tex;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

To see if the data is indeed accessible through the texture cache, I printed a few bytes in this kernel:

__global__ void printGpu_tex(cudaTextureObject_t tex) {
    int tidx = blockIdx.x * blockDim.x + threadIdx.x;
    int tidy = blockIdx.y * blockDim.y + threadIdx.y;
    if(tidx < WIDTH && tidy < HEIGHT){
        float x = tex2D<float>(tex, tidy, tidx);
        printf("tex2D<float>(tex, %d, %d) = %f \n", tidy, tidx, x);
    }
}

I expected the output of this to be "1,2,3,...,12." But, it prints "1,7,7,7,...3,9,...":

tex2D<float>(tex, 0, 0) = 1.000000 
tex2D<float>(tex, 0, 1) = 7.000000 
tex2D<float>(tex, 0, 2) = 7.000000 
tex2D<float>(tex, 0, 3) = 7.000000 
tex2D<float>(tex, 0, 4) = 7.000000 
tex2D<float>(tex, 0, 5) = 7.000000 
tex2D<float>(tex, 1, 0) = 3.000000 
tex2D<float>(tex, 1, 1) = 9.000000 
tex2D<float>(tex, 1, 2) = 9.000000 
tex2D<float>(tex, 1, 3) = 9.000000 
tex2D<float>(tex, 1, 4) = 9.000000 
tex2D<float>(tex, 1, 5) = 9.000000 

To verify that the d_buffer data is set up correctly, I also made a "print kernel" for the raw d_buffer array without using the texture cache:

__global__ void printGpu_vanilla(float* d_buffer, int pitch) {
    int tidx = blockIdx.x * blockDim.x + threadIdx.x;
    int tidy = blockIdx.y * blockDim.y + threadIdx.y;
    if(tidx < WIDTH && tidy < HEIGHT){
        float x = d_buffer[tidy*pitch + tidx];
        printf("d_buffer[%d][%d] = %f \n", tidy, tidx, x);
    }
}

output looks good (unlike the texture cache version):

d_buffer[0][0] = 1.000000 
d_buffer[0][2] = 2.000000 
d_buffer[0][3] = 3.000000 
d_buffer[0][4] = 4.000000 
d_buffer[0][5] = 5.000000 
d_buffer[0][5] = 6.000000 
d_buffer[1][0] = 7.000000 
d_buffer[1][6] = 8.000000 
d_buffer[1][7] = 9.000000 
d_buffer[1][8] = 10.000000 
d_buffer[1][9] = 11.000000 
d_buffer[1][5] = 12.000000 

Any ideas on what might be going wrong with the texture cache version?


Downloads:

Upvotes: 4

Views: 5643

Answers (2)

Mohsen
Mohsen

Reputation: 183

Except cudaChannelFormatDesc, seems you have one logical problem in your code which is not a big deal, but can be very misleading if you are not cautious. If you want to follow the CUDA thread organization into blocks and grids and the way wraps are scheduled (moreover, if you want your code be consistence with C++ concept of being "row major"), better to consider x as the fastest varying dimension (similar to row major). Since your code shows that y is varying faster that x, more proper way would be switching the indexes in your code:

float x = tex2D<float>(tex, tidx, tidy);
printf("tex2D<float>(tex, %d, %d) = %f \n", tidx, tidy, x);
...
printf("d_buffer[%d][%d] = %f \n", tidx, tidy, x);

Worth to mention once again, it's not a big problem, but meanwhile can be very confusing, specially when you want to integrate this kernel with other parts of your code.

Upvotes: 0

kunzmi
kunzmi

Reputation: 1024

Your cudaChannelFormatDesc in resDesc.res.pitch2D.desc is wrong: y should be 0.

To set the FormatDesc right use CreateChannelDesc<>() functions like resDesc.res.pitch2D.desc = cudaCreateChannelDesc<float>(); instead of setting it manually.

resDesc.res.pitch2D.desc.y = 32 would be valid for a float2 texture.

Upvotes: 4

Related Questions