How to optimize this OpenCL kernel?

I'm working on a project and I've got some problems with this OpenCL kernel :-(

__kernel void gemm_fast_5(
    __global double *ar, __global double *br, __global double *cr,
    __global double *pr, __global double *ur,

    unsigned long c, unsigned long c2,
    unsigned long c3, unsigned long c4,
    unsigned long c5, unsigned long m,
    unsigned char com
){
    unsigned long i = get_global_id(0);
    unsigned long j = get_global_id(1);

    unsigned long x = get_local_id(0);
    unsigned long y = get_local_id(1);

    unsigned long cur = i*c3 + j, rl, rl2, rl3;

    #if ks == 1 || ks == 2 || ks == 3 || ks == 4
    unsigned long rl4;
    #endif


    #if ks == 2
    rl = (i << 1)*c;
    #elif ks == 3
    rl = ((i << 1) + 1)*c;
    #else
    rl = i*c;
    #endif

    __local double ut, pt;

    if (x == 0) pt = pr[i*c4 + ks];
    if (y == 0) ut = ur[j*c5 + ks];

    double aa = 0.0;

    double bb, cc;
    double dd, ee;

    for (unsigned long k=0; k<m; k++){
        #if ks == 1 || ks == 4
        rl3 = (k << 1) + 1; rl4 = (k << 2) + 3;

        bb = ar[rl + rl3 - 1]; cc = ar[rl + rl3];
        dd = br[rl2 + rl4 - 1]; ee = br[rl2 + rl4 - 3];
        #elif ks == 2 || ks == 3
        rl3 = (k << 2) + 3; rl4 = (k << 1) + 1;

        bb = ar[rl + rl3 - 3]; cc = ar[rl + rl3 - 2];
        dd = br[rl2 + rl4]; ee = br[rl2 + rl4 - 1];
        #else
        rl3 = (k << 1) + 1;

        bb = ar[rl + rl3 - 1]; cc = ar[rl + rl3];
        dd = br[rl2 + rl3]; ee = br[rl2 + rl3 - 1];
        #endif

        aa += (bb + dd)*(cc + ee);
    }
    cr[cur] = aa - pt - ut;
}

While working, I noticed that if I delete the last line, the kernel takes 6 times less to run even changing the last line with cr[cur] = 5.0 - pt - ut; for example.

Shouldn't it take the same, or something similar at least? Even looking for an answer, taking advantage of the fact that I have CPU and GPU, I have tried in several runtime (PoCL and opencl-amd) and the same thing happens :-/

I would be very grateful if someone would give me a hand in making me understand why this happens. I don't understand :"v

Upvotes: 1

Views: 122

Answers (1)

Marco Bonelli
Marco Bonelli

Reputation: 69276

All the operations inside the loop do not have side effects, you only read from those __global pointers, and you calculate some temporary values that in the end get accumulated into aa through that final aa += .... In other words, the sole purpose of that loop is to calculate the value of aa.

Therefore, if you remove aa from the last line (outside the loop), all the operations inside the loop are completely useless, and you end up with a loop that does nothing except reading some values and updating local variables that will get discarded at function return. Compiling the above code with optimizations enabled (which I assume you are doing, otherwise your question wouldn't make much sense), the compiler is very likely to just get rid of the entire loop. Hence, the code without that final aa runs a lot faster.

Here's a GCC example (adapted removing CUDA annotations), where you can see that even the lowest level of optimization (-O1) removes the entire body of the loop, leaving only comparisons and the incrementing of i. With -O2, the whole loop is removed.

Upvotes: 1

Related Questions