Fly_back
Fly_back

Reputation: 269

why increasing the number of blocks in cuda increase the time?

My understanding is that in CUDA, increase the number of blocks will not increase the time as they are implemented parallelly, but in my code, if I double the number of blocks, the time doubled as well.

#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#define num_of_blocks 500
#define num_of_threads 512

__constant__ double y = 1.1;

// set seed for random number generator
__global__ void initcuRand(curandState* globalState, unsigned long seed){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    curand_init(seed, idx, 0, &globalState[idx]);
}

// kernel function for SIR
__global__ void test(curandState* globalState, double *dev_data){
    // global threads id
    int idx     = threadIdx.x + blockIdx.x * blockDim.x;

    // local threads id
    int lidx    = threadIdx.x;

    // creat shared memory to store seeds
    __shared__ curandState localState[num_of_threads];

    // shared memory to store samples
    __shared__ double sample[num_of_threads];

    // copy global seed to local
    localState[lidx]    = globalState[idx];
    __syncthreads();

    sample[lidx]    =  y + curand_normal_double(&localState[lidx]);

    if(lidx == 0){
        // save the first sample to dev_data;
        dev_data[blockIdx.x] = sample[0];
    }

    globalState[idx]    = localState[lidx];
}

int main(){
    // creat random number seeds;
    curandState *globalState;
    cudaMalloc((void**)&globalState, num_of_blocks*num_of_threads*sizeof(curandState));
    initcuRand<<<num_of_blocks, num_of_threads>>>(globalState, 1);

    double *dev_data;
    cudaMalloc((double**)&dev_data, num_of_blocks*sizeof(double));

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    // Start record
    cudaEventRecord(start, 0);

    test<<<num_of_blocks, num_of_threads>>>(globalState, dev_data);

    // Stop event
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
    // Clean up:
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    std::cout << "Time ellapsed: " << elapsedTime << std::endl;

    cudaFree(dev_data);
    cudaFree(globalState);
    return 0;
}

The test result is:

number of blocks: 500, Time ellapsed: 0.39136.
number of blocks: 1000, Time ellapsed: 0.618656.

So what is the reason that the time will increase? Is it because I access the constant memory or I copy the data from shared memory to global memory? Is that some ways to optimise it?

Upvotes: 0

Views: 1129

Answers (1)

tera
tera

Reputation: 7245

While the number of blocks being able to run in parallel can be large, it is still finite due to limited on-chip resources. If the number of blocks requested in a kernel launch exceeds that limit, any further blocks have to wait for earlier blocks to finish and free up their resources.

One limited resource is shared memory, of which your kernel uses 28 kilobytes. CUDA 8.0 compatible Nvidia GPUs offer between 48 and 112 kilobytes of shared memory per streaming multiprocessor (SM), so that the maximum number of blocks running at any one time is between 1× and 3× the number of SMs on your GPU.

Other limited resources are registers and various per-warp resources in the scheduler. The CUDA occupancy calculator is a convenient Excel spreadsheet (also works with OpenOffice/LibreOffice) that shows you how these resources limit the number of blocks per SM for a specific kernel. Compile the kernel adding the option --ptxas-options="-v" to the nvcc command line, locate the line saying "ptxas info : Used XX registers, YY bytes smem, zz bytes cmem[0], ww bytes cmem[2]", and enter XX, YY, the number of threads per block you are trying to launch, and the compute capability of your GPU into the spreadsheet. It will then show the maximum number of blocks that can run in parallel on one SM.

You don't mention the GPU you have been running the test on, so I'll use a GTX 980 as an example. It has 16 SMs with 96Kb of shared memory each, so at most 16×3=48 blocks can run in parallel. Had you not used shared memory, the maximum number of resident warps would have limited the number of blocks per SM to 4, allowing 64 blocks to run in parallel.

On any currently existing Nvidia GPU, your example requires at least about a dozen waves of blocks executing sequentially, explaining why doubling the number of blocks will also about double the runtime.

Upvotes: 4

Related Questions