Reputation: 269
My understanding is that in CUDA, increase the number of blocks will not increase the time as they are implemented parallelly, but in my code, if I double the number of blocks, the time doubled as well.
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#define num_of_blocks 500
#define num_of_threads 512
__constant__ double y = 1.1;
// set seed for random number generator
__global__ void initcuRand(curandState* globalState, unsigned long seed){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
curand_init(seed, idx, 0, &globalState[idx]);
}
// kernel function for SIR
__global__ void test(curandState* globalState, double *dev_data){
// global threads id
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// local threads id
int lidx = threadIdx.x;
// creat shared memory to store seeds
__shared__ curandState localState[num_of_threads];
// shared memory to store samples
__shared__ double sample[num_of_threads];
// copy global seed to local
localState[lidx] = globalState[idx];
__syncthreads();
sample[lidx] = y + curand_normal_double(&localState[lidx]);
if(lidx == 0){
// save the first sample to dev_data;
dev_data[blockIdx.x] = sample[0];
}
globalState[idx] = localState[lidx];
}
int main(){
// creat random number seeds;
curandState *globalState;
cudaMalloc((void**)&globalState, num_of_blocks*num_of_threads*sizeof(curandState));
initcuRand<<<num_of_blocks, num_of_threads>>>(globalState, 1);
double *dev_data;
cudaMalloc((double**)&dev_data, num_of_blocks*sizeof(double));
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Start record
cudaEventRecord(start, 0);
test<<<num_of_blocks, num_of_threads>>>(globalState, dev_data);
// Stop event
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
// Clean up:
cudaEventDestroy(start);
cudaEventDestroy(stop);
std::cout << "Time ellapsed: " << elapsedTime << std::endl;
cudaFree(dev_data);
cudaFree(globalState);
return 0;
}
The test result is:
number of blocks: 500, Time ellapsed: 0.39136.
number of blocks: 1000, Time ellapsed: 0.618656.
So what is the reason that the time will increase? Is it because I access the constant memory or I copy the data from shared memory to global memory? Is that some ways to optimise it?
Upvotes: 0
Views: 1129
Reputation: 7245
While the number of blocks being able to run in parallel can be large, it is still finite due to limited on-chip resources. If the number of blocks requested in a kernel launch exceeds that limit, any further blocks have to wait for earlier blocks to finish and free up their resources.
One limited resource is shared memory, of which your kernel uses 28 kilobytes. CUDA 8.0 compatible Nvidia GPUs offer between 48 and 112 kilobytes of shared memory per streaming multiprocessor (SM), so that the maximum number of blocks running at any one time is between 1× and 3× the number of SMs on your GPU.
Other limited resources are registers and various per-warp resources in the scheduler. The CUDA occupancy calculator is a convenient Excel spreadsheet (also works with OpenOffice/LibreOffice) that shows you how these resources limit the number of blocks per SM for a specific kernel. Compile the kernel adding the option --ptxas-options="-v"
to the nvcc
command line, locate the line saying "ptxas info : Used XX registers, YY bytes smem, zz bytes cmem[0], ww bytes cmem[2]", and enter XX, YY, the number of threads per block you are trying to launch, and the compute capability of your GPU into the spreadsheet. It will then show the maximum number of blocks that can run in parallel on one SM.
You don't mention the GPU you have been running the test on, so I'll use a GTX 980 as an example. It has 16 SMs with 96Kb of shared memory each, so at most 16×3=48 blocks can run in parallel. Had you not used shared memory, the maximum number of resident warps would have limited the number of blocks per SM to 4, allowing 64 blocks to run in parallel.
On any currently existing Nvidia GPU, your example requires at least about a dozen waves of blocks executing sequentially, explaining why doubling the number of blocks will also about double the runtime.
Upvotes: 4