Reputation: 4923

Texture memory with READ and WRITE

I'm developing one CUDA app where kernel has to go to global memory many times. This memory is accessed by all CTAs randomly (no locality, so cannot use shared memory). I need to optimize it. I heard that texture memory can alleviate this problem but can kernel read and write into texture memory? 1D texture memory? 2D texture memory? Also what about CUDA arrays?

Upvotes: 4

Answers (5)

Vitality

Reputation: 21515

This is a follow-up to Farzad's answer.

Farzad's point is highlighted in the CUDA C Programming Guide:

The texture and surface memory is cached (see Device Memory Accesses) and within the same kernel call, the cache is not kept coherent with respect to global memory writes and surface memory writes, so any texture fetch or surface read to an address that has been written to via a global write or a surface write in the same kernel call returns undefined data. In other words, a thread can safely read some texture or surface memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.

This means that one can modify the global memory locations the texture is bound to, but this must not happen in the same kernel in which texture fetches are operated. On the other side, "writing to a texture" in the above sense is possible across kernels since the texture cache is cleared upon a kernel launch, see cuda kernel for add(a,b,c) using texture objects for a & b - works correctly for 'increment operation' add(a,b,a)?.

Below, I'm providing an example in which the global memory locations the texture is bound to are modified. In this example, I call CUDA kernels in the following way

median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_out, N);
...
square<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, pitch, N);
...
median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_out, N);

In the median_filter_periodic_boundary kernel, texture fetches are operated, while in the square kernel, the global memory locations the texture is bound to are modified.

Here is the code:

#include <stdio.h>

#include "TimingGPU.cuh"
#include "Utilities.cuh"

texture<float, 1, cudaReadModeElementType> signal_texture;

#define BLOCKSIZE 32

/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void median_filter_periodic_boundary(float * __restrict__ d_out, const unsigned int N){

    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

        float signal_center = tex1D(signal_texture, (float)(tid + 0.5 - 0) / (float)N);
        float signal_before = tex1D(signal_texture, (float)(tid + 0.5 - 1) / (float)N);
        float signal_after  = tex1D(signal_texture, (float)(tid + 0.5 + 1) / (float)N);

        d_out[tid] = (signal_center + signal_before + signal_after) / 3.f;
        
    }
}
    
    /*************************************************/
    /* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
    /*************************************************/
__global__ void square(float * __restrict__ d_vec, const size_t pitch, const unsigned int N){

    unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) d_vec[tid] = 2.f * tid;

}

/********/
/* MAIN */
/********/
int main() {
    
    const int N = 10;                                                                                

    // --- Input/output host array declaration and initialization
    float *h_vec = (float *)malloc(N * sizeof(float));
    for (int i = 0; i < N; i++) h_vec[i] = (float)i;

    // --- Input/output host and device array vectors
    size_t pitch;
    float *d_vec;   gpuErrchk(cudaMallocPitch(&d_vec, &pitch, N * sizeof(float), 1));
    printf("pitch = %i\n", pitch);
    float *d_out;   gpuErrchk(cudaMalloc(&d_out, N * sizeof(float)));
    gpuErrchk(cudaMemcpy(d_vec, h_vec, N * sizeof(float), cudaMemcpyHostToDevice));
    
    // --- CUDA texture memory binding and properties definition
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
    //Alternatively
    //cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    size_t texture_offset = 0;
    gpuErrchk(cudaBindTexture2D(&texture_offset, signal_texture, d_vec, channelDesc, N, 1, pitch)); 
    signal_texture.normalized = true; 
    signal_texture.addressMode[0] = cudaAddressModeWrap;
    
    // --- Median filter kernel execution
    median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_out, N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(h_vec, d_out, N * sizeof(float), cudaMemcpyDeviceToHost));
    printf("\n\nFirst filtering\n");
    for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);

    // --- Square kernel execution
    square<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, pitch, N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(h_vec, d_vec, N * sizeof(float), cudaMemcpyDeviceToHost));
    printf("\n\nSquaring\n");
    for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);

    // --- Median filter kernel execution
    median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_out, N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    printf("\n\nSecond filtering\n");
    gpuErrchk(cudaMemcpy(h_vec, d_out, N * sizeof(float), cudaMemcpyDeviceToHost));
    for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);

    printf("Test finished\n");
    
    return 0;
}

Please, note the following:

I'm NOT binding the texture to a cudaArray, since cudaArrays cannot be modified from within kernels;
I'm NOT binding the texture to a cudaMalloced array, since textures bound to cudaMalloced arrays can only be fetched by tex1Dfetch and tex1Dfetch does not the cudaAddressModeWrap addressing mode guaranteeing the periodic extension of the signal outside its boundaries;
I'm binding the texture to a cudaMallocPitched array, since this makes it possible fetching the texture by tex1D, which allows the cudaAddressModeWrap addressing mode;
I'm using normalized coordinates, because only those enable the cudaAddressModeWrap addressing mode.

I needed points #2, #3 and #4, since I extracted this example from a code I was writing.

Upvotes: 3

Vitality

Reputation: 21515

This is a follow-up to sgarizvi's answer.

Nowadays, cards with compute capability >=2.0 are much more common than in 2012, namely, at the time this question was asked.

Below, a minimal example on how using CUDA surface memory to write to a texture.

#include <stdio.h>

#include "TimingGPU.cuh"
#include "Utilities.cuh"

surface<void, cudaSurfaceType1D> surfD;

/*******************/
/* KERNEL FUNCTION */
/*******************/
__global__ void SurfaceMemoryWrite(const int N) {

    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    surf1Dwrite((float)tid, surfD, tid * sizeof(float), cudaBoundaryModeTrap);
}

/********/
/* MAIN */
/********/
int main() {

    const int N = 10;

    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
    //Alternatively
    //cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

    cudaArray *d_arr;   gpuErrchk(cudaMallocArray(&d_arr, &channelDesc, N, 1, cudaArraySurfaceLoadStore));
    gpuErrchk(cudaBindSurfaceToArray(surfD, d_arr));

    SurfaceMemoryWrite<<<1, N>>>(N);

    float *h_arr = new float[N];
    gpuErrchk(cudaMemcpyFromArray(h_arr, d_arr, 0, 0, N * sizeof(float), cudaMemcpyDeviceToHost));

    for (int i=0; i<N; i++) printf("h_arr[%i] = %f\n", i, h_arr[i]);

    return 0;
}

Upvotes: 5

Farzad

Reputation: 3438

I came across this question, and with a bit of search I found this question and this answer to it useful. Basically texture memory is global memory. Texture memory refers to the special caching mechanism that can be associated with global memory reads. So kernel can manipulate global memory bounded to the texture. But as it shows in provided link there's no instruction such as tex1D(ref, x) = 12.0.

Upvotes: 0

Hong Zhou

Reputation: 649

I would reccomend declaring your memory as pitched linear memory and bind the with texture. I have not experiment with the new bindless texture yet. Anyone tried it?

Texture mem as mentioned is read-only through cache. Treat it as a read-only memory. Thus, it is important to note that within the Kernel itself, you do not write to the memory binded to the texture as it may not be updated to the texture cache.

Upvotes: 1

sgarizvi

Reputation: 16816

CUDA Textures are read only. Texture reads are cached. So performance gain is probabilistic.

CUDA Toolkit 3.1 onwards also have writeable textures known as Surfaces, but they are available only for devices with Compute Capability >=2.0. Surfaces are just like textures but the advantage is that they can also be written by the kernel.

Surfaces can only be bound to cudaArray created with flag cudaArraySurfaceLoadStore.

Upvotes: 11

Texture memory with READ and WRITE

Answers (5)

Related Questions