Reputation: 41
I'm having some issues regarding how to handle big matrices. Like explained in this other question I've a program which does work on big square matrices (like 5k-10k). The computational part is correct (still not 100% optimized) and I've tested it with smaller square matrices (like 256-512). Here is my code:
#define N 10000
#define RADIUS 100
#define SQRADIUS RADIUS*RADIUS
#define THREADS 512
//many of these device functions are declared
__device__ unsigned char avg(const unsigned char *src, const unsigned int row, const unsigned int col) {
unsigned int sum = 0, c = 0;
//some work with radius and stuff
return sum;
}
__global__ void applyAvg(const unsigned char *src, unsigned char *dest) {
unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x, tmp = 0;
unsigned int stride = blockDim.x * gridDim.x;
int col = tid%N, row = (int)tid/N;
while(tid < N*N) {
if(row * col < N * N) {
//choose which of the __device__ functions needs to be launched
}
tid += stride;
col = tid%N, row = (int)tid/N;
}
__syncthreads();
}
int main( void ) {
cudaError_t err;
unsigned char *base, *thresh, *d_base, *d_thresh, *avg, *d_avg;
int i, j;
base = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
thresh = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
avg = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
err = cudaMalloc((void**)&d_base, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 1"); exit(-1);}
err = cudaMalloc((void**)&d_thresh, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 2"); exit(-1);}
err = cudaMalloc((void**)&d_avg, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 3"); exit(-1);}
for(i = 0; i < N * N; i++) {
base[i] = (unsigned char)(rand() % 256);
}
err = cudaMemcpy(d_base, base, (N * N) * sizeof(unsigned char), cudaMemcpyHostToDevice);
if(err != cudaSuccess){printf("ERROR 4"); exit(-1);}
//more 'light' stuff to do before the 'heavy computation'
applyAvg<<<(N + THREADS - 1) / THREADS, THREADS>>>(d_thresh, d_avg);
err = cudaMemcpy(thresh, d_thresh, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
if(err != cudaSuccess) {printf("ERROR 5"); exit(-1);}
err = cudaMemcpy(avg, d_avg, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
if(err != cudaSuccess) {printf("ERROR 6"); exit(-1);}
getchar();
return 0;
}
When launching the problem with a big matrix (like 10000 x 10000) and a radius of 100 (which is how 'far' from every point in the matrix i look ahead) it takes so much time.
I believe that the problem resides both in the applyAvg<<<(N + THREADS - 1) / THREADS, THREADS>>>
(how many blocks and threads I decide to run) and in the applyAvg(...)
method (the stride and the tid).
Can someone clarify me which is the best way to decide how many blocks/threads to launch given that the matrix can vary from 5k to 10k size?
Upvotes: 0
Views: 385
Reputation: 9789
I suppose what you want to do is image filtering/convolution. based on your current cuda kernel, two thing you could do to improve the performance.
Use 2-D threads/blocks to avoid /
and %
operators. They are very slow.
Use shared memory to reduce global memory bandwidth.
Here's a white paper about image convolution. It shows how to implement a high performance box filer with CUDA.
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf
Nvidia cuNPP library also provides functions nppiFilterBox() and nppiFilterBox(), so you don't need write your own kernel. Here's the document and example.
http://docs.nvidia.com/cuda/cuda-samples/index.html#box-filter-with-npp
NPP doc pp.1009 http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf
Upvotes: 1