cuda big-matrix and blocks/threads

Question

I'm having some issues regarding how to handle big matrices. Like explained in this other question I've a program which does work on big square matrices (like 5k-10k). The computational part is correct (still not 100% optimized) and I've tested it with smaller square matrices (like 256-512). Here is my code:

#define N 10000
#define RADIUS 100
#define SQRADIUS RADIUS*RADIUS
#define THREADS 512

//many of these device functions are declared
__device__ unsigned char avg(const unsigned char *src, const unsigned int row, const unsigned int col) {
    unsigned int sum = 0, c = 0;

    //some work with radius and stuff

    return sum;
}

__global__ void applyAvg(const unsigned char *src, unsigned char *dest) {
    unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x, tmp = 0;
    unsigned int stride = blockDim.x * gridDim.x;
    int col = tid%N, row = (int)tid/N;

    while(tid < N*N) {
        if(row * col < N * N) {
            //choose which of the __device__ functions needs to be launched
        }

        tid += stride;
        col = tid%N, row = (int)tid/N;
    }
    __syncthreads();
}

int main( void ) {
    cudaError_t err;
    unsigned char *base, *thresh, *d_base, *d_thresh, *avg, *d_avg;
    int i, j;

    base = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
    thresh = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
    avg = (unsigned char*)malloc((N * N) * sizeof(unsigned char));

    err = cudaMalloc((void**)&d_base, (N * N) * sizeof(unsigned char));
    if(err != cudaSuccess) {printf("ERROR 1"); exit(-1);}
    err = cudaMalloc((void**)&d_thresh, (N * N) * sizeof(unsigned char));
    if(err != cudaSuccess) {printf("ERROR 2"); exit(-1);}
    err = cudaMalloc((void**)&d_avg, (N * N) * sizeof(unsigned char));
    if(err != cudaSuccess) {printf("ERROR 3"); exit(-1);}

    for(i = 0; i < N * N; i++) {
        base[i] = (unsigned char)(rand() % 256);
    }

    err = cudaMemcpy(d_base, base, (N * N) * sizeof(unsigned char), cudaMemcpyHostToDevice);
    if(err != cudaSuccess){printf("ERROR 4"); exit(-1);}

    //more 'light' stuff to do before the 'heavy computation'

    applyAvg<<<(N + THREADS - 1) / THREADS, THREADS>>>(d_thresh, d_avg);

    err = cudaMemcpy(thresh, d_thresh, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
    if(err != cudaSuccess) {printf("ERROR 5"); exit(-1);}
    err = cudaMemcpy(avg, d_avg, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
    if(err != cudaSuccess) {printf("ERROR 6"); exit(-1);}

    getchar();
    return 0;
}

When launching the problem with a big matrix (like 10000 x 10000) and a radius of 100 (which is how 'far' from every point in the matrix i look ahead) it takes so much time.

I believe that the problem resides both in the applyAvg<<<(N + THREADS - 1) / THREADS, THREADS>>> (how many blocks and threads I decide to run) and in the applyAvg(...) method (the stride and the tid). Can someone clarify me which is the best way to decide how many blocks/threads to launch given that the matrix can vary from 5k to 10k size?

kangshiyin · Accepted Answer

I suppose what you want to do is image filtering/convolution. based on your current cuda kernel, two thing you could do to improve the performance.

Use 2-D threads/blocks to avoid / and % operators. They are very slow.
Use shared memory to reduce global memory bandwidth.

Here's a white paper about image convolution. It shows how to implement a high performance box filer with CUDA.

http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf

Nvidia cuNPP library also provides functions nppiFilterBox() and nppiFilterBox(), so you don't need write your own kernel. Here's the document and example.

http://docs.nvidia.com/cuda/cuda-samples/index.html#box-filter-with-npp

NPP doc pp.1009 http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf

cuda big-matrix and blocks/threads

Answers (1)

Related Questions