Illuminati0x5B
Illuminati0x5B

Reputation: 602

Optimizing global memory load in CUDA

My task : I have two matrices : A - (18 x 4194304) ; B - (18 x 1024).

I have to take each 18-length vector from A and compute distance with each 18-length vector from B and find minimum distance and index.

My code :

__device__
void GetMin(float &dist, int &idx)
{
    float dist2;
    int idx2;
    dist2 = __shfl_down_sync(0xFFFFFFFF, dist, 16, 32);
    idx2 = __shfl_down_sync(0xFFFFFFFF, idx, 16);
    if (dist > dist2)
    {
        dist = dist2;
        idx = idx2;
    }

    dist2 = __shfl_down_sync(0xFFFFFFFF, dist, 8, 32);
    idx2 = __shfl_down_sync(0xFFFFFFFF, idx, 8);
    if (dist > dist2)
    {
        dist = dist2;
        idx = idx2;
    }

    dist2 = __shfl_down_sync(0xFFFFFFFF, dist, 4, 32);
    idx2 = __shfl_down_sync(0xFFFFFFFF, idx, 4);
    if (dist > dist2)
    {
        dist = dist2;
        idx = idx2;
    }

    dist2 = __shfl_down_sync(0xFFFFFFFF, dist, 2, 32);
    idx2 = __shfl_down_sync(0xFFFFFFFF, idx, 2);
    if (dist > dist2)
    {
        dist = dist2;
        idx = idx2;
    }

    dist2 = __shfl_down_sync(0xFFFFFFFF, dist, 1, 32);
    idx2 = __shfl_down_sync(0xFFFFFFFF, idx, 1);
    if (dist > dist2)
    {
        dist = dist2;
        idx = idx2;
    }
}

__global__
void CalcMinDist_kernel(const float *A, const float *B, float *output, const int nNumPixels, int nNumImages)
{
    int tx = threadIdx.x + blockIdx.x * blockDim.x;
    int ty = threadIdx.y;

    int lane_id = tx % 32;

    float dist = 0;
    int idx = 0;

    float fMin = 99999999;
    int nMinIdx = -1;

    for(int i = lane_id; i < 1024; i += 32)
    {
        dist = 0;
        for(int  j = 0; j < nNumImages; ++j)
        {
            int img_idx = blockIdx.x * ty + j * nNumPixels;
            dist += (A[img_idx] - B[i * nNumImages + j]) * 
                    (A[img_idx] - B[i * nNumImages + j]);
        }
        idx = i;
        GetMin(dist, idx);

        if(threadIdx.x == 0)
        {
            if(fMin > dist)
            {
                fMin = dist;
                nMinIdx = idx;
            }
        }
    }

    if(threadIdx.x == 0)
    {
        output[blockIdx.x * ty] = nMinIdx;
    }
}

Looking at the profiler, I'm memory bound, and do have ~90% occupancy. Is there any way to speed up this operation?

Let me know if I need to provide any other information.

Upvotes: 2

Views: 96

Answers (1)

einpoklum
einpoklum

Reputation: 131544

Actually, I would look at the algorithm first. This is a geometric problem - treat it as such.

You should represent the B data using a different data structure, e.g. by clustering or building a partition structure (e.g. k-d tree). That will let you avoid actually computing the distance from most B elements. (You could also consider a project onto fewer dimensions, but the benefit of this may be more elusive.)


With respect to the access pattern - you would probably benefit from having consecutive threads working on consecutive elements of the 18-element-long vectors, rather than having threads work on complete 18-element-long vectors individually. That would better fit the memory layout - right now, a warp read is of many elements which are at distance 18 from each other. If I understand the code correctly anyway.

(I also think the GetMin() could avoid some of the index swaps, but that's not significant since you only perform very few of those.)

Upvotes: 1

Related Questions