user1280671
user1280671

Reputation: 69

parallel reduction technique

I have this piece of C++ code and I want to port it to CUDA.

for (int im = 0; im < numImages; im++)
{
    for (p = 0; p < xsize*ysize; p++) 
    {
        bool ok = false;

        for (f = 0; f < numFeatures; f++)
        {
            if (feature[im][f][p] != 0) 
            {
                ok = true;
                break;
            }
        }
        if (ok)
        {         
            minDist = 1e9;
            for (i = 0; i < numBins; i++) 
            {
                dist = 0;
                for (f = 0; f < numFeatures; f++)
                {
                    dist += (float)((feature[im][f][p]-clusterPoint[f][i])*(feature[im][f][p]-clusterPoint[f][i]));
                }

                if (dist < minDist) 
                {
                    minDist = dist;
                    tmp = i;          
                }
            }//end for i  

            for (f = 0; f < numFeatures; f++) 
                csum[f][tmp] += feature[im][f][p];

            ccount[tmp]++;

            averageDist[tmp] += sqrt(minDist);

        } // end if (ok)
    }  //end for p    
}// end for im

I want to calculate csum,ccount and averageDist in the GPU. csum and averagedist are floats, ccount is an integer.

Is this a parallel reduction problem?

Upvotes: 0

Views: 260

Answers (2)

stuhlo
stuhlo

Reputation: 1507

I didn't fully understand what your code should do and I do not know what are approximate values of numBins and numFeatures as well. Nevertheless, I would make parallel this loop: for (p = 0; p < xsize*ysize; p++), in order to each thread computes its values and stores them in global array. Having these arrays of features and distances you can compute csum, ccount and averageDist using standard parallel reduction.

Main loop over images for (int im = 0; im < numImages; im++) is possible to compute by repetitive launching of kernel or it is also possible to make it parallel at once with loop over pixels.

In a case of if(ok) is not satisfied enough frequently, warp divergence occurs (see this). Avoiding this you can assign nor one thread to each pixel but one warp and divide remaining computations among the threads within this warp.

Upvotes: 1

Fr34K
Fr34K

Reputation: 544

Ya, you can use CUDA for summations. But, the number of elements should be large enough such that the time elapsed for summation on GPU should be less than time elapsed for summation on CPU. This may help you

Upvotes: 0

Related Questions