brnk
brnk

Reputation: 307

CUDA 6D for loop computation

I want to parallelize the following 6D nested for loop in CUDA (Pascal architecture).

const int NX = 250, NY = 250, NZ = 250, NA = 100, NB = 100, NC = 100;
float data_out[NX * NY * NZ];
float data_in[NA * NB * NC];
float datax[NX];
float datay[NY];
float dataz[NZ];

for (int ix = 0; ix < Nx; ix++)
{
    for (int iy = 0; iy < Ny; iy++)
    {
        for (int iz = 0; iz < Nz; iz++)
        {
            float result = 0.0f;
            for (int ia = 0; ia < NA; ia++)
            {
                for (int ib = 0; ib < NB; ib++)
                {
                    for (int ic = 0; ic < NC; ic++)
                    {
                        // some exemplary computation (see kernel)
                    }
                }
            }
            data_out[iz + iy * NZ + ix * (NZ * NY)] = result;
        }
    }
}

Currently, I implemented a kernel that performs the inner 3D nested for loop (loop variable ia, ib, ic), i.e., I don't use parallel reduction so far. Therefore, each kernel computes the sum of NA * NB * NC = 1000000 values.

EDIT: The computation in the for loop was updated to account for any nonlinear combination of the values, i.e., the values cannot be computed outside the for loop

__global__ void testKernel
(
    float *data_out,
    const float *data_in,
    const float *datax,
    const float *datay,
    const float *dataz,
    const int NX,
    const int NY,
    const int NZ,
    const int NA,
    const int NB,
    const int NC
)
{
    int ix = threadIdx.x + blockIdx.x*blockDim.x;
    int iy = threadIdx.y + blockIdx.y*blockDim.y;
    int iz = threadIdx.z + blockIdx.z*blockDim.z;

    if (ix >= NX || iy >= NY || iz >= NZ)
        return;

    float3 xyz = make_float3(datax[ix], datay[iy], dataz[iz]);
    float result = 0.0f;
    for (int ia = 0; ia < NA; ia++)
    {
        for (int ib = 0; ib < NB; ib++)
        {
            for (int ic = 0; ic < NC; ic++)
            {
                // some exemplary nonlinear computation to show memory access
                result += nonlinear_combination(data_in[ic + ib * NC + ia * (NC * NB)], xyz, ia, ib, ic);
            }
        }
    }
    data_out[iz + iy * NZ + ix * (NZ * NY)] = result;
}

int main()
{
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    const int NX = 250, NY = 250, NZ = 250, NA = 100, NB = 100, NC = 100;

    float *d_data_out, *d_data_in, *d_datax, *d_datay, *d_dataz;

    cudaMalloc((void**)&d_data_out, NX * NY * NZ * sizeof(float));
    cudaMalloc((void**)&d_data_in, NA * NB * NC * sizeof(float));
    cudaMalloc((void**)&d_datax, NX * sizeof(float));
    cudaMalloc((void**)&d_datay, NY * sizeof(float));
    cudaMalloc((void**)&d_dataz, NZ * sizeof(float));

    dim3 blockSize(8, 8, 8);
    dim3 gridSize(128, 128, 64);

    cudaEventRecord(start);
    testKernel<<<gridSize, blockSize>>>(d_data_out, d_data_in, d_datax, d_datay, d_dataz, NX, NY, NZ, NA, NB, NC);
    cudaEventRecord(stop);

    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    printf("Elapsed time: %.2f ms\n", milliseconds);

    cudaFree(d_data_out);
    cudaFree(d_data_in);
    cudaFree(d_datax);
    cudaFree(d_datay);
    cudaFree(d_dataz);

    return 0;
}

Is there any benefit of parallelizing the inner for loop as well using parallel reduction, as the total number of iterations of the outer for loop (NX * NY * NZ = 15625000) is already higher than the total number of parallel threads?

Also, how can I optimize the memory access? It might be beneficial to ensure that each thread in a block is accessing the same portion of the data and copy this portion of the data to the shared memory, right?

Upvotes: 1

Views: 95

Answers (1)

Mikhail  M
Mikhail M

Reputation: 960

Generally your approach looks right. 15625000 threads is very much, even for latest GPUs with 10000 cores. For them about 250000 threads is desirable. Although your blocks-threads division will waste a lot of runned threads. Because 128 threads by x * 8 blocks by x = 1024 and much less than NX = 250. And so on.

And also CUDA would not allow you to run more than 1024 threads in one block. You can use block size like (NX, 1, 1) and grid size - like (1, NY, NZ), to economize some computations. Only it's desirable block thread size would be divisible by 32.

For coalesced memory access be sure that neighbour threads access neighbour memory cells and (desirably) the block is aligned to (about) 64 bytes. Fastest changing index of threads is x, so for example in the first warp threads will have y and z = 0 and x = 0, ... 31.

You did right by summing in local variable and writing result only once.

As to reducing number of threads, this can economize your threads initializations. 15625000 executions of block int ix = threadIdx.x + blockIdx.x*blockDim.x; and below or less. Since your inner 3D loop is huge, this promises very small gain.

And yes, some additional regrouping of computations can make possible not to read your data_in array so much times. Look at classic matrix multiplications on GPU example.

I would also try loops unrolling. After ensuring coalesced and minimal memory access of course (but maybe you will not need shared memory since GPU will use SMs memory as cache automatically). After getting first working version you will be able to get many insights about effectiveness of your code with NSight Compute.

Upvotes: 1

Related Questions