__threadfence_block() and volatile + shared memory to fight registers

Question

Question 1:

Can threadfence and volatile help the compiler to flush the data and release the registers?

Question 2:

If the shared memory is used only as a thread cache (no data is exchanged using SMEM among threads), it is safe to rely in the execution order? I mean, if one instruction change the SMEM in a specific address, and some other instruction latter in the code read it in the very same thread without any global call, is necessary to worry about fences/sincronization?

Background

After some time in a frustrated attempt to mitigate register spills using shared memory in a kernel plenty of nested loops, come to my attention that the registers count was not changing at all. Taking a look in the ptxa I noticed that it happens because the compiler "delayed" instructions in a way that the registers was never free generating spills.

Using the volatile keyword in the SMEM declaration released some registers, and the __threadfence_block() in one of the hottest loop gave the same result, but with a very small performance gain (about 5%).

Kernel Code:

struct __align__(16) ushort8
{
    unsigned short w, x, y, z, k, l, m, n;
};

typedef struct ushort8 ushort8;


__global__ void altMax(const unsigned short nloops, const unsigned short clipper,
    const unsigned short p, const unsigned int npart, const unsigned int stride,
    unsigned short*  Partbondaries,
    ushort8* tpMaxPart, CUdeviceptr* indMax, unsigned long long int* ops) {
    const unsigned short cWarpSize(def_cWarpSize);
// this variable should help to reduce the register pressure
    __shared__ float fel[6][THREADS_MAX];
const int tid(blockDim.x * blockIdx.x + threadIdx.x);
const unsigned int lId(threadIdx.x & 0x1f);
if (tid > npart - 1) return;
const unsigned short rl(Partbondaries[tid] + 1 - def_off);
size_t l_ops(0);

ushort8 el;
int kPos;
float cbMax, ftemp, pb0(0), tl6, tl7, tl8;// , tl[loff + 1];
                                          // alternative pattern midbody [cpu seek]
for (int i = 0; i < nloops - 1; i++) {
    tex3D(&ftemp, ssm3D, Partbondaries[(i)* stride + tid] - 1,
        Partbondaries[(i + 1) * stride + tid] - 1, 0);
    pb0 += ftemp;
}
// alternative pattern tail [cpu seek]
tex3D(&ftemp, ssm3D, Partbondaries[(nloops - 1)* stride + tid] - 1, p - 1, 0);
pb0 += ftemp;
// alternative pattern head [gpu seek]

cbMax = idMax(indMax);
ftemp = 0;
kPos = 0;
for (el.w = 1; el.w < rl + 0; el.w++) {
    if (kPos > 0) tex3D(&ftemp, ssm3D, 0, el.w - 1, 0);
    fel[0][threadIdx.x] = ftemp;
    for (el.x = el.w + 1; el.x < rl + 1; el.x++) {
        if (kPos > 1) tex3D(&ftemp, ssm3D, el.w, el.x - 1, 0);
        ftemp += fel[0][threadIdx.x];
        fel[1][threadIdx.x] = ftemp;
        for (el.y = el.x + 1; el.y < rl + 2; el.y++) {
            if (kPos > 2) tex3D(&ftemp, ssm3D, el.x, el.y - 1, 0);
            ftemp += fel[1][threadIdx.x];
            fel[2][threadIdx.x] = ftemp;
            for (el.z = el.y + 1; el.z < rl + 3; el.z++) {
                if (kPos > 3) tex3D(&ftemp, ssm3D, el.y, el.z - 1, 0);
                ftemp += fel[2][threadIdx.x];
                fel[3][threadIdx.x] = ftemp;
                for (el.k = el.z + 1; el.k < rl + 4; el.k++) {
                    if (kPos > 4) tex3D(&ftemp, ssm3D, el.z, el.k - 1, 0);
                    ftemp += fel[3][threadIdx.x];
                    fel[4][threadIdx.x] = ftemp;
                    for (el.l = el.k + 1; el.l < rl + 5; el.l++) {
                        if (kPos > 5) tex3D(&ftemp, ssm3D, el.k, el.l - 1, 0);
                        ftemp += fel[4][threadIdx.x];
                        fel[5][threadIdx.x] = ftemp;
                        __threadfence_block(); 
                        for (el.m = el.l + 1; el.m < rl + 6; el.m++) {
                            if (kPos > 6) tex3D(&ftemp, ssm3D, el.l, el.m - 1, 0);
                            tl6 = fel[5][threadIdx.x] + ftemp;
                            tl6 += pb0;
                            ftemp = 0;
                            for (el.n = el.m + 1; el.n < rl + 7; el.n++) {
                                tex3D(&tl7, ssm3D, el.m, el.n - 1, 0);
                                // testar a substituição por constante
                                tex3D(&tl8, ssm3D, el.n, rl - 1, 0); // tem q ser conferido
                                tl8 += tl7;
                                l_ops++;
                                if (tl8 > ftemp) {
                                    ftemp = tl8;
                                    kPos = el.n;
                                }
                            }
                            ftemp += tl6;
                            if (ftemp > cbMax) {
                                el.n = kPos;
                                cbMax = ftemp;
                                tpMaxPart[tid] = el;
                            }
                        }
                        kPos = 6;
                    }
                    kPos = 5;
                }
                kPos = 4;
            }
            kPos = 3;
        }
        kPos = 2;
    }
    kPos = 1;
}
// warp lvl reduction
unsigned short maxtd, ttd;
maxtd = lId;
#pragma unroll 
for (int i = 1; cWarpSize > i; i *= 2) {
    pb0 = __shfl_down_sync(UINT32_MAX, cbMax, i, cWarpSize);
    ttd = __shfl_down_sync(UINT32_MAX, maxtd, i, cWarpSize);
    l_ops += __shfl_xor_sync(UINT32_MAX, l_ops, i, cWarpSize);
    if (pb0 > cbMax) {
        cbMax = pb0;
        maxtd = ttd;
    }
}
maxtd = __shfl_sync(UINT32_MAX, maxtd, 0, cWarpSize);

// tem q conferir se todos os valores estão realmente sincronizando td
if (lId == maxtd) {
    atomicAdd(ops, l_ops);
    idMax(indMax, cbMax, tid);
}

}

Robert Crovella · Accepted Answer

Can threadfence and volatile help the compiler to flush the data and release the registers?

Probably in some cases. You seem to be already suggesting in your question that you have confirmed that this is the case. I would generally consider this not a very productive form of optimization (fighting against the compiler) but that's just an opinion or a personal preference. There's not enough here to actually pose an experiment or provide a concrete answer.

"releasing" registers in this fashion is just exchanging register use for some form of data load/store traffic. That is normally not a win, and the compiler generally tries to avoid that. You may have found a case where you can do slightly better. This kind of compiler optimization process can be fairly complex, and the current state of the art does not guarantee optimality. It only tries to achieve that in a reasonable amount of computation time. If you think you have found an egregious counter-example, then you may want to file a bug at developer.nvidia.com, with a full compilable code necessary to witness the issue, along with both cases identified for comparison. Of course you're welcome to file a bug under any circumstances, but I'm not sure a 5% observation will garner much attention.

If the shared memory is used only as a thread cache (no data is exchanged using SMEM among threads), it is safe to rely in the execution order? I mean, if one instruction change the SMEM in a specific address, and some other instruction latter in the code read it in the very same thread without any global call, is necessary to worry about fences/sincronization?

It is is not necessary to worry about fences or synchronization, if the shared memory usage is restricted to a single thread (i.e. no use of shared memory to share data between threads.) In that case, the single threaded C/C++ programming model applies, and you can be confident that if the thread saves a value to shared memory and then loads that value later, it will get the correct value.

__threadfence_block() and volatile + shared memory to fight registers

Answers (1)

Related Questions