CUDA tex1Dfetch and pitched memory

Question

My question is "Can the tex1Dfetch function be used with pitched linear memory"?

Section B.8.1.1.tex1dfetch() says "fetches from the region of linear memory....".

I am optimizing a kernel which is bound by global memory reads. It uses a grid-stride loop to load a float4:

float4 x = XYZW[i]; // float4 const * const XYZW
float4 x = tex1Dfetch



The first example, using pointer argument, works fine.  The tex1Dfetch form returns all zeros.  The tex1D call returns aliased/psychadelic garbage.

In all cases, cudaMallocPitched is used.  The texture loads profiled faster, with tex1Dfetch the fastest.  So I'm keen to get it working correctly.

Robert Crovella · Accepted Answer

My question is "Can the tex1Dfetch function be used with pitched linear memory"?

In C/C++, a pointer is simply a bare number. Once you allocate a region, the returned pointer itself isn't conceptually any different whether it was returned by cudaMalloc or cudaMallocPitch; it is the address of a location in device memory.

tex1Dfetch can be made to work with either type of allocation (cudaMalloc or cudaMallocPitch). The actual values you get will depend on how you handle the pitch regions in the cudaMallocPitch case, and tex1Dfetch won't do anything "automatic" there for you. So it seems not very sensible (to me) to use tex1Dfetch with a pitched allocation, but it can be made to "work", ie. fetch the correct value.

Here is a simple demonstrator of this claim, based on slight modification to the code here:

$ cat t434.cu
#include 
#define N 32
#define M 128

// texture object is a kernel argument
__global__ void kernel(cudaTextureObject_t tex) {
  int i = blockIdx.x *blockDim.x + threadIdx.x;
  float x = tex1Dfetch(tex, i);
  if (i < 256) printf("%d %f
", i, x);
}

void call_kernel(cudaTextureObject_t tex) {
  dim3 block(128,1,1);
  dim3 grid((N*M)/block.x,1,1);
  kernel <<>>(tex);
}

int main() {
  // declare and allocate memory
  float *buffer, *h_buffer;
  size_t pitch;
  cudaMallocPitch(&buffer, &pitch, N*sizeof(float),M);
  printf("pitch = %lu
", pitch);
  cudaMemset(buffer, 0, M*pitch);
  h_buffer=(float *)malloc(N*M*sizeof(float));
  for (int i = 0; i < M; i++)
    for (int j = 0; j < N; j++) h_buffer[i*N+j] = i+1;
  cudaMemcpy2D(buffer, pitch, h_buffer, N*sizeof(float), N*sizeof(float), M,  cudaMemcpyHostToDevice);
  // create texture object
  cudaResourceDesc resDesc;
  memset(&resDesc, 0, sizeof(resDesc));
  resDesc.resType = cudaResourceTypeLinear;
  resDesc.res.linear.devPtr = buffer;
  resDesc.res.linear.desc.f = cudaChannelFormatKindFloat;
  resDesc.res.linear.desc.x = 32; // bits per channel
  resDesc.res.linear.sizeInBytes = M*pitch*sizeof(float);

  cudaTextureDesc texDesc;
  memset(&texDesc, 0, sizeof(texDesc));
  texDesc.readMode = cudaReadModeElementType;

  // create texture object: we only have to do this once!
  cudaTextureObject_t tex=0;
  cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

  call_kernel(tex); // pass texture as argument

  // destroy texture object
  cudaDestroyTextureObject(tex);

  cudaFree(buffer);
}
$ nvcc -arch=sm_61 -o t434 t434.cu
$ cuda-memcheck ./t434
========= CUDA-MEMCHECK
pitch = 512
0 1.000000
1 1.000000
2 1.000000
3 1.000000
4 1.000000
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000
14 1.000000
15 1.000000
16 1.000000
17 1.000000
18 1.000000
19 1.000000
20 1.000000
21 1.000000
22 1.000000
23 1.000000
24 1.000000
25 1.000000
26 1.000000
27 1.000000
28 1.000000
29 1.000000
30 1.000000
31 1.000000
32 0.000000
33 0.000000
34 0.000000
35 0.000000
36 0.000000
37 0.000000
38 0.000000
39 0.000000
40 0.000000
41 0.000000
42 0.000000
43 0.000000
44 0.000000
45 0.000000
46 0.000000
47 0.000000
48 0.000000
49 0.000000
50 0.000000
51 0.000000
52 0.000000
53 0.000000
54 0.000000
55 0.000000
56 0.000000
57 0.000000
58 0.000000
59 0.000000
60 0.000000
61 0.000000
62 0.000000
63 0.000000
96 0.000000
97 0.000000
98 0.000000
99 0.000000
100 0.000000
101 0.000000
102 0.000000
103 0.000000
104 0.000000
105 0.000000
106 0.000000
107 0.000000
108 0.000000
109 0.000000
110 0.000000
111 0.000000
112 0.000000
113 0.000000
114 0.000000
115 0.000000
116 0.000000
117 0.000000
118 0.000000
119 0.000000
120 0.000000
121 0.000000
122 0.000000
123 0.000000
124 0.000000
125 0.000000
126 0.000000
127 0.000000
64 0.000000
65 0.000000
66 0.000000
67 0.000000
68 0.000000
69 0.000000
70 0.000000
71 0.000000
72 0.000000
73 0.000000
74 0.000000
75 0.000000
76 0.000000
77 0.000000
78 0.000000
79 0.000000
80 0.000000
81 0.000000
82 0.000000
83 0.000000
84 0.000000
85 0.000000
86 0.000000
87 0.000000
88 0.000000
89 0.000000
90 0.000000
91 0.000000
92 0.000000
93 0.000000
94 0.000000
95 0.000000
128 2.000000
129 2.000000
130 2.000000
131 2.000000
132 2.000000
133 2.000000
134 2.000000
135 2.000000
136 2.000000
137 2.000000
138 2.000000
139 2.000000
140 2.000000
141 2.000000
142 2.000000
143 2.000000
144 2.000000
145 2.000000
146 2.000000
147 2.000000
148 2.000000
149 2.000000
150 2.000000
151 2.000000
152 2.000000
153 2.000000
154 2.000000
155 2.000000
156 2.000000
157 2.000000
158 2.000000
159 2.000000
192 0.000000
193 0.000000
194 0.000000
195 0.000000
196 0.000000
197 0.000000
198 0.000000
199 0.000000
200 0.000000
201 0.000000
202 0.000000
203 0.000000
204 0.000000
205 0.000000
206 0.000000
207 0.000000
208 0.000000
209 0.000000
210 0.000000
211 0.000000
212 0.000000
213 0.000000
214 0.000000
215 0.000000
216 0.000000
217 0.000000
218 0.000000
219 0.000000
220 0.000000
221 0.000000
222 0.000000
223 0.000000
160 0.000000
161 0.000000
162 0.000000
163 0.000000
164 0.000000
165 0.000000
166 0.000000
167 0.000000
168 0.000000
169 0.000000
170 0.000000
171 0.000000
172 0.000000
173 0.000000
174 0.000000
175 0.000000
176 0.000000
177 0.000000
178 0.000000
179 0.000000
180 0.000000
181 0.000000
182 0.000000
183 0.000000
184 0.000000
185 0.000000
186 0.000000
187 0.000000
188 0.000000
189 0.000000
190 0.000000
191 0.000000
224 0.000000
225 0.000000
226 0.000000
227 0.000000
228 0.000000
229 0.000000
230 0.000000
231 0.000000
232 0.000000
233 0.000000
234 0.000000
235 0.000000
236 0.000000
237 0.000000
238 0.000000
239 0.000000
240 0.000000
241 0.000000
242 0.000000
243 0.000000
244 0.000000
245 0.000000
246 0.000000
247 0.000000
248 0.000000
249 0.000000
250 0.000000
251 0.000000
252 0.000000
253 0.000000
254 0.000000
255 0.000000
========= ERROR SUMMARY: 0 errors

We see that the allocation pitch is 512 bytes. This corresponds to a total pitched width of 128 float quantities per row. However the 2D allocation only specifies 32 elements per row (N). We initialize a 2D region with 32 elements per row, where each value is the row index plus 1. The remaining "pitch" regions are initialized to zero. We observe in the output that the validly defined row 0 elements properly print out as 1, the validly defined row 1 elements properly print out as 2, and all other elements print out as zero, because we are fetching from the "pitch" region.

CUDA tex1Dfetch and pitched memory

Answers (1)

Related Questions