Vinc37
Vinc37

Reputation: 43

Some CUDA computations fail with larger block dimension (< 1024)

I am learning CUDA with a GTX 960 4GB. I wrote a program which performs an element-wise matrix multiplication. When I increase the block dimensions for x and y to lets say (32 x 32) in combination with a large matrix (lets say 15000 x 15000 elements), some but not all multiplication results are wrong (value 0 instead of 6).

When I then decrease the block dimensions to e.g (8 x 8), all results are right again. When I decrease the Matrix size, the results are right again, too.

So in case of this example, there seems to be combinations of total threads and threads per block, which does not work.

I am surprised I can't find any threads regarding this topic. All I can find is about increasing performance and occupancy, but not about when some but not all calculations are aborted.

The grid dimensions are calculated as follows:

dim3 blocks(ceil<int>(COLS / threads.x), ceil<int>(ROWS / threads.y));

Why do some multiplications fail while others are successful?

Some Examples

Block dim    : (8, 8)
Matrix shape : (15000, 15000)
Verification : 0 elements have failed, total length 225000000, shape: (15000, 15000)

Block dim    : (16, 16)
Matrix shape : (15000, 15000)
Verification : 239936 elements have failed, total length 225000000, shape: (15000, 15000)

Block dim    : (32, 32)
Matrix shape : (15000, 15000)
Verification : 719424 elements have failed, total length 225000000, shape: (15000, 15000).

Block dim    : (32, 32)
Matrix shape : (10000, 10000)
Verification : 0 elements have failed, total length 100000000, shape: (10000, 10000).

Driver Version

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  470.82.00  Thu Oct 14 10:24:40 UTC 2021

Complete Code

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define ROWS 10000
#define COLS 10000
#define MAX_ERR 1e-6

typedef struct {
    int width;
    int height;
    float* elements;
} Matrix;

size_t ij(int i, int j){
    return j * ROWS + i;
}


__global__ void matrix_multi_elemwise(const Matrix OUT, const Matrix A, const Matrix B) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    if (col < A.width && row < A.height) {
        int index = row * A.height + col;  // linearisation of index
        OUT.elements[index] = A.elements[index] * B.elements[index];
    }
}

int main(){
    Matrix A, B, OUT;
    Matrix dev_A, dev_B, dev_OUT; 

    size_t SIZE = ROWS * COLS * sizeof(float);

    // Allocate host memory
    A.elements = (float*) malloc(SIZE);
    B.elements = (float*) malloc(SIZE);
    OUT.elements = (float*) malloc(SIZE);

    // Initialize host matrices
    A.height = ROWS; A.width = COLS;
    B.height = ROWS; B.width = COLS;
    OUT.height = ROWS; OUT.width = COLS;

    for (int j = 0; j < ROWS; j++) {
        for(int i = 0; i < COLS; i++){
            A.elements[ij(i, j)] = 2.0f;
            B.elements[ij(i, j)] = 3.0f;
        }
    }

    // Allocate device memory
    cudaMalloc((void**) &dev_A.elements, SIZE);
    cudaMalloc((void**) &dev_B.elements, SIZE);
    cudaMalloc((void**) &dev_OUT.elements, SIZE);

    dev_A.height = A.height; dev_A.width = A.width;
    dev_B.height = A.height; dev_B.width = B.width;
    dev_OUT.height = A.height; dev_OUT.width = OUT.width;

    // Transfer data from host to device memory
    cudaMemcpy(dev_A.elements, A.elements, SIZE, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_B.elements, B.elements, SIZE, cudaMemcpyHostToDevice);

    // Executing kernel 
    dim3 threads(16, 16);
    dim3 blocks(ceil<int>(COLS / threads.x), ceil<int>(ROWS / threads.y));
    matrix_multi_elemwise<<<blocks, threads>>>(dev_OUT, dev_A, dev_B);

    cudaError_t err = cudaGetLastError();
    if(err != cudaSuccess) {
        printf("CUDA Runtime API Error reported : %s in file %s on line.\n", cudaGetErrorString(err), __FILE__);
    }
    
    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();  
    
    // Transfer data back to host memory
    cudaMemcpy(OUT.elements, dev_OUT.elements, SIZE, cudaMemcpyDeviceToHost);

    // Verification
    int count = 0, length = 0, i = 0, j = 0;
    for (j = 0; j < ROWS; j++) {
        for(i = 0; i < COLS; i++){
            //assert(fabs(OUT.elements[ij(i, j)] / A.elements[ij(i, j)] - B.elements[ij(i, j)]) < MAX_ERR);
            if (fabs(OUT.elements[ij(i, j)] / A.elements[ij(i, j)] - B.elements[ij(i, j)]) > MAX_ERR) {
                count++;
            }
            length++;
        }
    }
    printf("Verification: %i elements have failed, total length %i, shape: (%i, %i).\n", count, length, i, j);

    // Deallocate device memory
    cudaFree(dev_A.elements);
    cudaFree(dev_B.elements);
    cudaFree(dev_OUT.elements);
    
    // Deallocate host memory
    free(A.elements); 
    free(B.elements); 
    free(OUT.elements);
}

Upvotes: 1

Views: 156

Answers (1)

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50278

The number of blocks is wrong. Indeed, COLS and threads.x are both integers. Thus, the result is a truncated integer. ceil<int> cannot ceil the result as it has already been truncated. This cause some blocks not to be computed: 15000 is divisible by 8 but not by 16. You need to either cast COLS to a floating-point number or compute the ceil result manually (safer). Here is an example:

dim3 blocks((COLS + threads.x - 1) / threads.x, (ROWS + threads.y - 1) / threads.y);

As pointed out in the comment, note that row * A.height + col is wrong: it should be row * A.width + col instead. This causes issues for non square matrices.

Upvotes: 3

Related Questions