Reputation: 902
I am new in Cuda development and I decided to start scripting small examples in order to understand how it is working. I decided to share the kernel function that I make and computes the squared euclidean distance between the corresponding rows of two equal sized matrices.
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols )
int i, squareEuclDist = 0;
int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
//int c = blockDim.y * blockIdx.y + threadIdx.y; // cols
if( r < rows ){ // take each row with var r (thread)
for ( i = 0; i < cols; i++ )//compute squared Euclid dist of each row
squareEuclDist += ( A[r + rows*i] - B[r + rows*i] ) * ( A[r + rows*i] - B[r + rows*i] );
C[r] = squareEuclDist;
squareEuclDist = 0;
The kernel initialization is done by
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements / threadsPerBlock);
// numElements = 1500x200 (matrix size) ==> 1172 blocks/grid
and is called as
cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols );
The d_A and d_B are the inserted matrices, in this example of size 1500 x 200.
Question 1: I have read the basic theory of choosing the threads per block and the blocks per grid number but is still something missing. I try to understand in this simple kernel what is the optimum kernel parameter initialization and I am asking a little help to start thinking in CUDA way.
Question 2: An other thing I would like to ask is if there are any suggestions about how can we improve the code efficiency? Can we use int c = blockDim.y * blockIdx.y + threadIdx.y
to make things more parallel?Share memory is applicable here?
Below, my GPU info is attached.
Device 0: "GeForce 9600 GT"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
( 8) Multiprocessors x ( 8) CUDA Cores/MP: 64 CUDA Cores
GPU Clock rate: 1680 MHz (1.68 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Concurrent kernel execution: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Question 3: Can we express the amount of global memory with that of shared memory and other type of memories that GPU has? Does the number of threads has to do with that?
Question 4: If the maximum number of threads per block is 512 how is possible the maximum sizes of each dimension of a block be 512x512x62 (= 16252628 threads)? What the correlation with my maximum sizes of each dimension of a grid?
Question 5: Using the memory clock rate can we say how many threads are processed at each second?
The for loop replaced with column threads
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols ){
int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
int c = blockDim.y * blockIdx.y + threadIdx.y; // cols
float x=0;
if(c < cols && r < rows){
x = ( A[c + r*cols] - B[c + r*cols] ) * ( A[c + r*cols] - B[c + r*cols] );
C[r] = x;
Called with:
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements / threadsPerBlock);
cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols );
Upvotes: 0
Views: 291
Reputation: 2179
Ok, so there are few things related to a kernel, one is number of multiprocessors (associated with blocks) and number of cores (associated with cores), blocks are scheduled to run on a multiprocessor (which is 8 for you), threads are scheduled to run on multiple cores on a single multiprocessor. Ideally you would like to have enough number of blocks and threads so that all you multi-processors and all cores in each multi-processor are occupied. It is advisable to have larger number of blocks and threads when compared to multi-processors and cores as coalescing of threads/blocks can be done.
multiple dimensions make programming easier (for eg: 2D/3D images, you could divide the image into sub-parts and give it to different blocks and then process those sub-images on multiple threads), it is more intuitive to use multiple dimensions (x, y, z) for accessing blocks and threads. In some cases, it helps you to have more dimensions if there is a restriction in maximum number of blocks in one dimension (for example if you had a large image, you may hit a limit on maximum number of blocks if you just use one dimension).
I am not sure if I understand what you mean in your third question, I can tell a bit about shared memory. Shared memory is present on a single multi-processor, it is shared by cores on the processor. For you, the amount of shared memory is 16KB, most modern GPUs have 64KB of shared memory on a processor and you can chose how much you want to have for your application, 16KB in the 64KB is generally reserved for cache and you can use the remaining 48KB for you or increase the cache size and lower your shared memory size. Shared memory is much faster than global memory, so incase you have some data which will be accessed frequently, it would be wise to transfer it to shared memory. The number of threads is not at all related to shared memory. Also, global memory and shared memory are separate.
If you can see, each block dimension is less than 512, you cannot have more than 512 threads per block (limit has been changed to 1024 in newer CUDA versions on better architectures). Till Fermi each processor had 32 or 48 cores so it didn't make much sense to have more than 512 threads. The new Kepler architecture has 192 cores per multi-processor.
Threads are executed in a warp, which is generally 16 threads clubbed together and executed on the cores in a multi-processor simultaneously. If you assume that there is always a miss in the shared memory, depending on the number of cores you have per multiprocessor and the memory clock rate, you can calculate how may threads would be processed each second (you would need to take into account the number of instructions which are processed per thread also, there would also be some time involved for processing operations on registers etc).
I hope that answers your questions to some extent.
Upvotes: 2
Reputation: 9781
A1. Optimize the threads per block is basically heuristics. You could try
for(int threadsPerBlock=32; threadsPerBlock<=512;threadsPerBlock+=32){...}
A2. Currently you use one thread per row and sum the elements to squareEuclDist
linearly. You could consider use one thread block per row. Within the block, each thread computes the square-difference of one element and you could use parallel reduction to sum them together. Please refer to the following link for parallel reduction.
A3. the list you show is the total amount of global/shared memory. Multiple threads will share these hardware resources. You could find this tool in your cuda installation dir to help you calculate the exact number per thread of those hardware resources you can use in a particular kernel.
A4. maximum sizes of each dimension
does not mean all dimensions can reach their max at the same time. However there's no limitation on block per grid, so 65536x65536x1 blocks in a grid is possible.
A5. mem clock has nothing to do with the thread number. You could read the programming model section in cuda doc for more info.
Upvotes: 2