Reputation: 69
I have this piece of C++ code and I want to port it to CUDA.
for (int im = 0; im < numImages; im++)
{
for (p = 0; p < xsize*ysize; p++)
{
bool ok = false;
for (f = 0; f < numFeatures; f++)
{
if (feature[im][f][p] != 0)
{
ok = true;
break;
}
}
if (ok)
{
minDist = 1e9;
for (i = 0; i < numBins; i++)
{
dist = 0;
for (f = 0; f < numFeatures; f++)
{
dist += (float)((feature[im][f][p]-clusterPoint[f][i])*(feature[im][f][p]-clusterPoint[f][i]));
}
if (dist < minDist)
{
minDist = dist;
tmp = i;
}
}//end for i
for (f = 0; f < numFeatures; f++)
csum[f][tmp] += feature[im][f][p];
ccount[tmp]++;
averageDist[tmp] += sqrt(minDist);
} // end if (ok)
} //end for p
}// end for im
I want to calculate csum
,ccount
and averageDist
in the GPU. csum
and averagedist
are floats, ccount
is an integer.
Is this a parallel reduction problem?
Upvotes: 0
Views: 260
Reputation: 1507
I didn't fully understand what your code should do and I do not know what are approximate values of numBins
and numFeatures
as well. Nevertheless, I would make parallel this loop: for (p = 0; p < xsize*ysize; p++)
, in order to each thread computes its values and stores them in global array. Having these arrays of features and distances you can compute csum
, ccount
and averageDist
using standard parallel reduction.
Main loop over images for (int im = 0; im < numImages; im++)
is possible to compute by repetitive launching of kernel or it is also possible to make it parallel at once with loop over pixels.
In a case of if(ok)
is not satisfied enough frequently, warp divergence occurs (see this). Avoiding this you can assign nor one thread to each pixel but one warp and divide remaining computations among the threads within this warp.
Upvotes: 1
Reputation: 544
Ya, you can use CUDA for summations. But, the number of elements should be large enough such that the time elapsed for summation on GPU should be less than time elapsed for summation on CPU. This may help you
Upvotes: 0