CUDA: Problems Using tex2D()

Question

I am having a lot of trouble trying to retrieve the matrix element A(m,n) from a 2D CUDA array. Interestingly, I obtain the correct element when m = n; i.e. the element is along the diagonal. Otherwise, I get some unexpected behavior: if, for example, I want to fetch element A(13,12), and I attempt to retrieve it using tex2D(tex, row + 0.5f, col + 0.5f), I get A(14,11). As far as I know, I am doing everything as I should, so I am really interested in knowing where I went wrong.

The kernel follows below. The mistake occurs right after the first two tex2D calls, so the rest is not really relevant.

texture tex_a;
texture tex_b;

// Assume that BinaryFunc is multiplication, and AccumulationFunc is addition.
// Then this kernel computes the standard matrix product, and uses prefetching
// with tile sizes given by the template parameter TileSize. 
template 
    __global__ void
matrix_prod_tex_prefetch(T* c, const SizeType dim, BinaryFunc binary_func,
        AccumulationFunc accum_func)
{
    __shared__ T as[TileSize][TileSize];
    __shared__ T bs[TileSize][TileSize];
    SizeType row = blockIdx.y * TileSize + threadIdx.y;
    SizeType col = blockIdx.x * TileSize + threadIdx.x;
    T p = 0;

    T l = tex2D(tex_a, row + 0.5f, threadIdx.x + 0.5f);
    T m = tex2D(tex_b, threadIdx.y + 0.5f, col + 0.5f);
    __syncthreads();

    for (SizeType i = 1; i != dim / TileSize; ++i) {
        as[threadIdx.y][threadIdx.x] = l;
        bs[threadIdx.y][threadIdx.x] = m;
        __syncthreads();
        l = tex2D(tex_a, row + 0.5f, i * TileSize + threadIdx.x + 0.5f);
        m = tex2D(tex_b, i * TileSize + threadIdx.y + 0.5f, col + 0.5f);
        for (SizeType k = 0; k != TileSize; ++k) {
            p = accum_func(p, binary_func(
                        as[threadIdx.y][k],
                        bs[k][threadIdx.x]
                        ));
        }
        __syncthreads();
    }

    as[threadIdx.y][threadIdx.x] = l;
    bs[threadIdx.y][threadIdx.x] = m;
    __syncthreads();
    for (SizeType k = 0; k != TileSize; ++k) {
        p = accum_func(p, binary_func(
                    as[threadIdx.y][k],
                    bs[k][threadIdx.x]
                    ));
    }
    c[dim * row + col] = p;
}

void-pointer · Accepted Answer

Answer: swap threadIdx.x with threadIdx.y. Ultimately, it comes down to a matter of semantics: textures use the indices as offsets along the x- and y-axis we are all familiar with. Matrices use the indices to refer to the index of the row and column. Essentially, the basis vectors are swapped.

Be warned that while swapping the use of threadIdx.x and threadIdx.y for 1D memory layouts may yield equivalent results, you may lose coalesced memory access patterns in the process.

CUDA: Problems Using tex2D()

Answers (1)

Related Questions