Wesley Ranger
Wesley Ranger

Reputation: 780

Can anyone modify the following code to avoid the usage of "if"?

I am working on cuda currently and I got stuck on the code below. The code was originally written in matlab and I am trying to re-write it using cuda:

Pv = 0; Nv =0;
[LOOP]
v1 = l(li);
v2 = l(li+1);
if((v1>0) && (v2>0))
    Pv = Pv + 1;
elseif((v1<0) && (v2<0))
    Nv = Nv +1;
elseif((v1>0) && (v2<0))
    r = v1/(v1-v2);
    Pv = Pv + r;
    Nv = Nv + 1 - r;
elseif((v1<0) && (v2>0))
    r = v2/(v2-v1);
    Pv = Pv + r;
    Nv = Nv + 1 - r;
end
[LOOP END]

But, in cuda architecture, "if" expression is sometimes expensive, and I believe there is some way to avoid the usage of it, although I cannot figure it out now.

The main purpose of the code is to calculate the ratio of positive interval of negative interval and add them up respectively. In most of the situation , v1 and v2 have the same sign, but once they have different sign, I have to use a bunch of "if" or even perhaps "abs()" to handle the situation.

So, can anyone help me to re-write the code using C while using as few "if " as possible?

Upvotes: 1

Views: 208

Answers (3)

njuffa
njuffa

Reputation: 26085

Since you stated that the case of non-zero operands of like sign is the most common case, it would be best to handle the rare cases 3 and 4 using a branch, especially since they require an expensive division operation which we would not want to execute in the common (fast) path. Considering that min() and max() as well as the ternary operator are directly supported by hardware in GPUs, and that GPUs have extensive support for predication, I would suggest the slight re-write shown below. I have tested this using random combinations of two elements drawn from the set {-0.0f, -3.0f, -5.0f, 0.0f, 3.0f, 5.0f} to make sure cases with v1==0 or v2==0 are covered.

float Nv, Pv, v1, v2;
float r;
if ((v1 > 0) && (v2 > 0)) {
    Pv = Pv + 1;
} 
if ((v1 < 0) && (v2 < 0)) {
    Nv = Nv + 1;
}
if (((v1 > 0) && (v2 < 0)) || ((v1 < 0) && (v2 > 0))) { // rare case
    float s = min (v1, v2);
    float t = max (v1, v2);
    r = t / (t - s);
    Pv = Pv + r;
    Nv = Nv + 1 - r;
}

Compiled for sm_20, the code generated by the compiler is branchless, except for the rare slow path:

   /*0008*/         FSETP.LT.AND P2, PT, RZ, c[0x0][0x2c], PT;
   /*0010*/         FSETP.GT.AND P3, PT, RZ, c[0x0][0x28], PT;
   /*0018*/         FSETP.LT.AND P1, PT, RZ, c[0x0][0x28], PT;
   /*0020*/         MOV32I R4, 0x3f800000;
   /*0028*/         FSETP.GT.AND P4, PT, RZ, c[0x0][0x2c], PT;
   /*0030*/         PSETP.AND.AND P5, PT, P3, P2, PT;
   /*0038*/         PSETP.AND.AND P0, PT, P1, P2, PT;
   /*0040*/         FADD R0, R4, c[0x0][0x24];
   /*0048*/         PSETP.AND.AND P2, PT, P3, P4, PT;
   /*0050*/         FADD R4, R4, c[0x0][0x20];
   /*0058*/         PSETP.AND.OR P1, PT, P1, P4, P5;
   /*0060*/         MOV R2, c[0x0][0x30];
   /*0068*/         MOV R3, c[0x0][0x34];
   /*0070*/         SEL R0, R0, c[0x0][0x24], P0;
   /*0078*/         SEL R6, R4, c[0x0][0x20], P2;
   /*0080*/    @!P1 BRA 0xc8;
   /*0088*/         MOV R4, c[0x0][0x28];
   /*0090*/         FMNMX R5, R4, c[0x0][0x2c], PT;
   /*0098*/         FMNMX R4, R4, c[0x0][0x2c], !PT;
   /*00a0*/         FADD R5, R4, -R5;
   /*00a8*/         CAL 0xe0;                 // division
   /*00b0*/         FADD R5, R6, 1;
   /*00b8*/         FADD R0, R0, R4;
   /*00c0*/         FADD R6, R5, -R4;
   /*00c8*/         FADD R0, R0, R6;

Upvotes: 3

alk
alk

Reputation: 70893

Adding to ablish's answer:

This

r = (v1 == v2) ?0 :((v1 > v2) * v1 - (v2 > v1) * v2) / (v1 - v2) * (v1 * v2 < 0);

can be replaced by this

double ra[2] = {((v1 > v2) * v1 - (v2 > v1) * v2) / (v1 - v2) * (v1 * v2 < 0), 0.};
r = ra[!!(v1 == v2)];

removing the last ternary operator.


Update:

The !! is useless so the 2nd line above should just be:

r = ra[v1 == v2];

As pointed out by abligh in a comment to this answer we might get a division by zero during initialisation of ra, so we only want to do the appropriate calculation:

double r0(double v1, double v2)
{
  return 0;
}

double r1(double v1, double v2)
{
  return ((v1 > v2) * v1 - (v2 > v1) * v2) / (v1 - v2) * (v1 * v2 < 0);
}

double (*rf[2])(double v1, double v2) = {r0, r1}; 

...

r = rf[fabs(v1 - v2) < EPSILON](v1, v2); /* With EPSILON like 0.00001 or what ever accuracy is needed. */

Upvotes: 2

abligh
abligh

Reputation: 25119

Utilising the fact that each of the expressions you use gives a 0 or 1 binary expression, I can get it down to one ternary operator:

// r is 0 if not case 3 or 4
r = (v1==v2)?0:((v1>v2)*v1-(v2>v1)*v2)/(v1-v2) * (v1*v2<0);
Pv += r + // cases 3 & 4
     (v1>0) && (v2>0); // case 1
Nv += r + // cases 3 & 4
     (v1<=0) || (v2<=0); // cases 2, 3 and 4

(Edit: further optimisation untested!)

However, this smells a lot like premature optimisation. Is it really the number of if statements that's going to cause the issue?

Upvotes: 4

Related Questions