Reputation: 4903
I'm developing one CUDA app where kernel has to go to global memory many times. This memory is accessed by all CTAs randomly (no locality, so cannot use shared memory). I need to optimize it. I heard that texture memory can alleviate this problem but can kernel read and write into texture memory? 1D texture memory? 2D texture memory? Also what about CUDA arrays?
Upvotes: 4
Views: 9139
Reputation: 21465
This is a follow-up to Farzad's answer.
Farzad's point is highlighted in the CUDA C Programming Guide:
The texture and surface memory is cached (see Device Memory Accesses) and within the same kernel call, the cache is not kept coherent with respect to global memory writes and surface memory writes, so any texture fetch or surface read to an address that has been written to via a global write or a surface write in the same kernel call returns undefined data. In other words, a thread can safely read some texture or surface memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.
This means that one can modify the global memory locations the texture is bound to, but this must not happen in the same kernel in which texture fetches are operated. On the other side, "writing to a texture" in the above sense is possible across kernels since the texture cache is cleared upon a kernel launch, see cuda kernel for add(a,b,c) using texture objects for a & b - works correctly for 'increment operation' add(a,b,a)?.
Below, I'm providing an example in which the global memory locations the texture is bound to are modified. In this example, I call CUDA kernels in the following way
median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_out, N);
...
square<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, pitch, N);
...
median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_out, N);
In the median_filter_periodic_boundary
kernel, texture fetches are operated, while in the square
kernel, the global memory locations the texture is bound to are modified.
Here is the code:
#include <stdio.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
texture<float, 1, cudaReadModeElementType> signal_texture;
#define BLOCKSIZE 32
/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void median_filter_periodic_boundary(float * __restrict__ d_out, const unsigned int N){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
float signal_center = tex1D(signal_texture, (float)(tid + 0.5 - 0) / (float)N);
float signal_before = tex1D(signal_texture, (float)(tid + 0.5 - 1) / (float)N);
float signal_after = tex1D(signal_texture, (float)(tid + 0.5 + 1) / (float)N);
d_out[tid] = (signal_center + signal_before + signal_after) / 3.f;
}
}
/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void square(float * __restrict__ d_vec, const size_t pitch, const unsigned int N){
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) d_vec[tid] = 2.f * tid;
}
/********/
/* MAIN */
/********/
int main() {
const int N = 10;
// --- Input/output host array declaration and initialization
float *h_vec = (float *)malloc(N * sizeof(float));
for (int i = 0; i < N; i++) h_vec[i] = (float)i;
// --- Input/output host and device array vectors
size_t pitch;
float *d_vec; gpuErrchk(cudaMallocPitch(&d_vec, &pitch, N * sizeof(float), 1));
printf("pitch = %i\n", pitch);
float *d_out; gpuErrchk(cudaMalloc(&d_out, N * sizeof(float)));
gpuErrchk(cudaMemcpy(d_vec, h_vec, N * sizeof(float), cudaMemcpyHostToDevice));
// --- CUDA texture memory binding and properties definition
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
//Alternatively
//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
size_t texture_offset = 0;
gpuErrchk(cudaBindTexture2D(&texture_offset, signal_texture, d_vec, channelDesc, N, 1, pitch));
signal_texture.normalized = true;
signal_texture.addressMode[0] = cudaAddressModeWrap;
// --- Median filter kernel execution
median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_out, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_vec, d_out, N * sizeof(float), cudaMemcpyDeviceToHost));
printf("\n\nFirst filtering\n");
for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);
// --- Square kernel execution
square<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, pitch, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_vec, d_vec, N * sizeof(float), cudaMemcpyDeviceToHost));
printf("\n\nSquaring\n");
for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);
// --- Median filter kernel execution
median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_out, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("\n\nSecond filtering\n");
gpuErrchk(cudaMemcpy(h_vec, d_out, N * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);
printf("Test finished\n");
return 0;
}
Please, note the following:
cudaArray
, since cudaArray
s cannot be modified from within kernels;cudaMalloc
ed array, since textures bound to cudaMalloc
ed arrays can only be fetched by tex1Dfetch
and tex1Dfetch
does not the cudaAddressModeWrap
addressing mode guaranteeing the periodic extension of the signal outside its boundaries;cudaMallocPitch
ed array, since this makes it possible fetching the texture by tex1D
, which allows the cudaAddressModeWrap
addressing mode;cudaAddressModeWrap
addressing mode.I needed points #2
, #3
and #4
, since I extracted this example from a code I was writing.
Upvotes: 3
Reputation: 21465
This is a follow-up to sgarizvi's answer.
Nowadays, cards with compute capability >=2.0
are much more common than in 2012
, namely, at the time this question was asked.
Below, a minimal example on how using CUDA surface memory to write to a texture.
#include <stdio.h>
#include "TimingGPU.cuh"
#include "Utilities.cuh"
surface<void, cudaSurfaceType1D> surfD;
/*******************/
/* KERNEL FUNCTION */
/*******************/
__global__ void SurfaceMemoryWrite(const int N) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
surf1Dwrite((float)tid, surfD, tid * sizeof(float), cudaBoundaryModeTrap);
}
/********/
/* MAIN */
/********/
int main() {
const int N = 10;
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
//Alternatively
//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray *d_arr; gpuErrchk(cudaMallocArray(&d_arr, &channelDesc, N, 1, cudaArraySurfaceLoadStore));
gpuErrchk(cudaBindSurfaceToArray(surfD, d_arr));
SurfaceMemoryWrite<<<1, N>>>(N);
float *h_arr = new float[N];
gpuErrchk(cudaMemcpyFromArray(h_arr, d_arr, 0, 0, N * sizeof(float), cudaMemcpyDeviceToHost));
for (int i=0; i<N; i++) printf("h_arr[%i] = %f\n", i, h_arr[i]);
return 0;
}
Upvotes: 5
Reputation: 3438
I came across this question, and with a bit of search I found this question and this answer to it useful.
Basically texture memory is global memory. Texture memory refers to the special caching mechanism that can be associated with global memory reads. So kernel can manipulate global memory bounded to the texture. But as it shows in provided link there's no instruction such as tex1D(ref, x) = 12.0
.
Upvotes: 0
Reputation: 649
I would reccomend declaring your memory as pitched linear memory and bind the with texture. I have not experiment with the new bindless texture yet. Anyone tried it?
Texture mem as mentioned is read-only through cache. Treat it as a read-only memory. Thus, it is important to note that within the Kernel itself, you do not write to the memory binded to the texture as it may not be updated to the texture cache.
Upvotes: 1
Reputation: 16796
CUDA Textures are read only. Texture reads are cached. So performance gain is probabilistic.
CUDA Toolkit 3.1 onwards also have writeable textures known as Surfaces, but they are available only for devices with Compute Capability >=2.0. Surfaces are just like textures but the advantage is that they can also be written by the kernel.
Surfaces can only be bound to cudaArray
created with flag cudaArraySurfaceLoadStore
.
Upvotes: 11