Reputation: 417
I am writing a cuda program for matching each input image of resolution ~180X180, with about 10,000 template images of resolution ~128*128. The goal is to achieve realtime performance i.e. Template matching of 25~30 input images(each with all 10,000 templates) in 1 second.
currently I am using following approach
so far for 1 input Image to 10,000 templates,it is taking around 2 seconds.
My questions are:
Machine specs: [i7-4770, 8GB, GTX-680]
Explanation of current kernel code:
Kernel Code:
__global__ void cudaMatchTemplate(TemplateArray *templates, uchar *Match)
{
int global = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int idx[TEMPLATE_MATCH_DIM];
__shared__ float out_shared[TEMPLATE_MATCH_DIM];
//halving the template size....
int rows = (templates[blockIdx.x].nHeight)/2;
int cols = (templates[blockIdx.x].nWidth)/2;
int fullCol = templates[blockIdx.x].nWidth;
int x = templates[blockIdx.x].nMatchLeft;
int y = templates[blockIdx.x].nMatchTop;
int offset_y = (threadIdx.x/TEMPLATE_MATCH_SIZE);
int offset_x = (threadIdx.x - offset_y*TEMPLATE_MATCH_SIZE);
// *************** Performing match in time domain *****************************//
int sum = 0;
float temp;
int idxXFactor = 3*(2*(offset_x) + x);
int idxYFactor = 2*(offset_y) + y ;
for (int i = 0; i < rows; i++)
{
int I=3*i*fullCol;
int sourceIdxY = idxYFactor + 2*i;
for (int j = 0; j < cols; j++)
{
int J=3*j;
int sourceIdxX = idxXFactor + 2*J;
int templateIdx = 2*I+2*J;
//**** R *****//
temp = float(tex2D(SourceImgColorTex,sourceIdxX,sourceIdxY)) - float(templates[blockIdx.x].pRAWPixels_gpu[templateIdx]);
sum = sum + temp*temp;
//**** G *****//
temp = float(tex2D(SourceImgColorTex,sourceIdxX+1,sourceIdxY)) - float(templates[blockIdx.x].pRAWPixels_gpu[templateIdx +1]);
sum = sum + temp*temp;
//**** B *****//
temp = float(tex2D(SourceImgColorTex,sourceIdxX+2,sourceIdxY)) - float(templates[blockIdx.x].pRAWPixels_gpu[templateIdx +2]);
sum = sum + temp*temp;
}
}
__syncthreads();
//placing all values in shared memory for comparison.
if(threadIdx.x < TEMPLATE_MATCH_DIM)
{
idx[threadIdx.x] = threadIdx.x;
out_shared[threadIdx.x] = sum;
}
__syncthreads();
// //computing the Min location.....//
#pragma unroll
for(int s=512; s>0; s>>=1)
{
if ((threadIdx.x < s) &&((threadIdx.x + s)<TEMPLATE_MATCH_DIM))
{
idx[threadIdx.x] = out_shared[threadIdx.x] < out_shared[threadIdx.x + s] ? idx[threadIdx.x] : idx[threadIdx.x + s];
out_shared[threadIdx.x] = out_shared[threadIdx.x] < out_shared[threadIdx.x + s] ? out_shared[threadIdx.x] : out_shared[threadIdx.x + s];
}
}
__syncthreads();
if(threadIdx.x <1)
{
int half_Margin = MARGIN_FOR_TEMPLATE_MATCH/2;
int matchY = idx[0]/TEMPLATE_MATCH_SIZE ;
int matchX = idx[0] - matchY * TEMPLATE_MATCH_SIZE;
int diff = absolute(half_Margin - matchX) + absolute(half_Margin - matchY);
if(diff < THRESHOLD)
{
Match[blockIdx.x] = 1;
}
else
Match[blockIdx.x] = 0;
}
}
Upvotes: 1
Views: 814
Reputation: 1246
I'll try to answer most of your questions.
Is there is way to determine if this task is achievable in realtime or not? I mean with the help of maximum FLOPS and I/O bandwidth limitations etc.
I have no idea how to determine whether or not the kernel is real time achievable , you can maximize your CUDA kernel using CUDA Occupancy Calculator. You may consider using texture, surface memory, constant memory, pinned host memory and more. Those are up to your algorithm implementation.
How to compute if the GPU is being fully utilitzed at its maximum?
You can use CUDA Occupancy Calculator and CUDA visual profiler. I'd highly recommend using visual profiler it would guide you through CUDA understanding.
Possible ways to improve the performance?
There are several interesting method doing so. 1st you can maximize your kernel call using the above method. If that's not enough, try implement pipeline using stream objects in order copy the data and computation jobs at the same time.
If that's not going to work out, try working with latency, operate multiple threads accessing the GPU at the same time, since CC 3.5 CUDA launched HyperQ, this may help you complete several calls in parallel.
If that's not going to work, consider using multiple GPU devices.
Upvotes: 1