spiehr
spiehr

Reputation: 411

measure cuda execution time with respect to number of blocks

I wrote my first cuda script and wonder about how it is parallelized. I have some variables r0_dev and f_dev, arrays of length (*fnum_dev) * 3 each. On each block, they are read sequentially. Then there are r_dev, which I read, and v_dev which I want to write at in parallel, both are arrays of length gnum * 3.

The program produces the results that I want it to produce, but the complexity (function of time with respect to data size) is not what I would have expected.

My expectation is, that, when the size of the array v_dev increases, the execution time stays constant, for values of gnum smaller than the number of blocks allowed in some dimension.

Reality is different. With the following code, the time was measured. A linear complexity is observed, what I would have expexted in sequential code.

dim3 blockGrid(gnum);

cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

// the actual calculation
stokeslets<<<blockGrid, 1>>>(fnum_dev, r0_dev, f_dev, r_dev, v_dev);

// time measurement
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);

Question:

Is my expectation, described above, wrong? What additional considerrations are important?

Details

The following shows the implementation of stokeslet. Maybe I'm doing something bad there?

__device__ void gridPoint(int offset, float* r0, float* f, float* r, float* v) {
    int flatInd = 3 * offset;

    float dr[3];
    float len = 0;
    float drf = 0;

    for (int i = 0; i < 3; i++) {
        dr[i] = r[i] - r0[i + flatInd];
        len += dr[i] * dr[i];
        drf += dr[i] * f[i + flatInd];
    }

    len = sqrt(len);
    float fak = 1 / (8 * 3.1416 * 0.7);
    v[0] +=  (fak / len) * (f[0 + flatInd] + (dr[0]) * drf / (len * len));
    v[1] +=  (fak / len) * (f[1 + flatInd] + (dr[1]) * drf / (len * len));
    v[2] +=  (fak / len) * (f[2 + flatInd] + (dr[2]) * drf / (len * len));
}


__global__ void stokeslets(int* fnum, float* r0, float* f, float* r, float* v) {
    // where are we (which block, which is equivalent to the grid point)?
    int idx = blockIdx.x;

    // we want to add all force contributions
    float rh[3] = {r[3 * idx + 0], r[3 * idx + 1], r[3 * idx + 2]};

    float vh[3] = {0, 0, 0};
    for (int i=0; i < *fnum; i++) {
        gridPoint(i, r0, f, rh, vh);
    }
    // sum intermediate velocity vh
    int flatInd = 3 * idx;
    v[0 + flatInd] += vh[0];
    v[1 + flatInd] += vh[1];
    v[2 + flatInd] += vh[2];
}

Upvotes: 1

Views: 224

Answers (1)

Vitality
Vitality

Reputation: 21515

The main problem with you code is that you are running multiple blocks containing only one thread.

Quoting the CUDA C Programming Guide

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

A multiprocessor is designed to execute hundreds of threads concurrently.

Quoting the answer to the post How CUDA Blocks/Warps/Threads map onto CUDA Cores?

The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units.

From the two bold-marked sentences, it follows that you have only 2 threads running per clock cycle per streaming multiprocessor. This is the main reason why you are observing essentially the same computational complexity as for the sequential case.

The recommendation is to rewrite your code/kernel to host the possibility to run multiple threads per block.

Further readings: The Fermi architecture whitepaper.

Upvotes: 3

Related Questions