Alberdi
Alberdi

Reputation: 43

My OpenCL code changes the output based on a seemingly noop

I'm running the same OpenCL kernel code on an Intel CPU and on a NVIDIA GPU and the results are wrong on the first but right on the latter; the strange thing is that if I do some seemingly irrelevant changes the output works as expected in both cases.

The goal of the function is to calculate the matrix multiplication between A (triangular) and B (regular), where the position of A in the operation is determined by the value of the variable left. The bug only appears when left is true and when the for loop iterates at least twice.

Here is a fragment of the code omitting some bits that shouldn't affect for the sake of clarity.

__kernel void blas_strmm(int left, int upper, int nota, int unit, int row, int dim, int m, int n,
                         float alpha, __global const float *a, __global const float *b, __global float *c) {

  /* [...] */
  int ty = get_local_id(1);
  int y = ty + BLOCK_SIZE * get_group_id(1);
  int by = y;
  __local float Bs[BLOCK_SIZE][BLOCK_SIZE];
  /* [...] */

  for(int i=start; i<end; i+=BLOCK_SIZE) {
    if(left) {
      ay = i+ty;
      bx = i+tx;
    }   
    else {
      ax = i+tx;
      by = i+ty;
    }   

    barrier(CLK_LOCAL_MEM_FENCE);
    /* [...] (Load As) */
    if(bx >= m || by >= n)
      Bs[tx][ty] = 0;
    else
      Bs[tx][ty] = b[bx*n+by];
    barrier(CLK_LOCAL_MEM_FENCE);

    /* [...] (Calculate Csub) */
  }

  if(y < n && x < (left ? row : m)) // In bounds
    c[x*n+y] = alpha*Csub;
}

Now it gets weird.

As you can see, by always equals y if left is true. I checked (with some printfs, mind you) and left is always true, and the code on the else branch inside the loop is never executed. Nevertheless, if I remove or comment out the by = i+ty line there, the code works. Why? I don't know yet, but I though it might be something related to by not having the expected value assigned.

My train of thought took me to check if there was ever a discrepancy between by and y, as they should have the same value always; I added a line that checked if by != y but that comparison always returned false, as expected. So I went on and changed the appearance of by for y so the line

if(bx >= m || by >= n)

transformed into

if(bx >= m || y >= n)

and it worked again, even though I'm still using the variable by properly three lines below.

With an open mind I tried some other things and I got to the point that the code works if I add the following line inside the loop, as long as it is situated at any point after the initial if/else and before the if condition that I mentioned just before.

if(y >= n) left = 1;

The code inside (left = 1) can be substituted for anything (a printf, another useless assignation, etc.), but the condition is a bit more restrictive. Here are some examples that make the code output the correct values:

if(y >= n) left = 1;
if(y < n) left = 1;
if(y+1 < n+1) left = 1;
if(n > y) left = 1;

And some that don't work, note that m = n in the particular example that I'm testing:

if(y >= n+1) left = 1;
if(y > n) left = 1;
if(y >= m) left = 1;
/* etc. */

That's the point where I am now. I have added a line that shouldn't affect the program at all but it makes it work. This magic solution is not satisfactory to me and I would like to know what's happening inside my CPU and why.

Just to be sure I'm not forgetting anything, here is the full function code and a gist with example inputs and outputs.

Thank you very much.


Solution

Both users DarkZeros and sharpneli were right about their assumptions: the barriers inside the for loop weren't being hit the right amount of times. In particular, there was a bug involving the very first element of each local group that made it run one iteration less than the rest, provoking an undefined behaviour. It was painfully obvious to see in hindsight.

Thank you all for your answers and time.

Upvotes: 4

Views: 423

Answers (2)

sharpneli
sharpneli

Reputation: 1621

Have you checked that the get_local_size always returns the correct value?

You said "In short, the full length of the matrix is divided in local blocks of BLOCK_SIZE and run in parallel; ". Remember that OpenCL allows any concurrency only within a workgroup. So if you call enqueueNDrange with global size of [32,32] and local size of [16,16] it is possible that the first thread block runs from start to finish, then the second one, then third etc. You cannot synchronize between workgroups.

What are your EnqueueNDRange call(s)? Example of the calls required to get your example output would be heavily appreciated (mostly interested in the global and local size arguments).

(I'd ask this in a comment but I am a new user).

E (Had an answer, upon verification did not have it, still need more info): http://multicore.doc.ic.ac.uk/tools/GPUVerify/

By using that I got a complaint that a barrier could be reached by a nonuniform control flow.

It all depends on what values dim, nota and upper get. Could you provide some examples?

I did some testing. Assuming left = 1. nota != upper and dim = 32, row as 16 or 32 or whatnot, still worked and got the following result:

...
gid0: 2 gid1: 0 lid0: 14 lid1: 13 start:  0  end: 32
gid0: 2 gid1: 0 lid0: 14 lid1: 14 start:  0  end: 32
gid0: 2 gid1: 0 lid0: 14 lid1: 15 start:  0  end: 32
gid0: 2 gid1: 0 lid0: 15 lid1:  0 start:  0  end: 48
gid0: 2 gid1: 0 lid0: 15 lid1:  1 start:  0  end: 48
gid0: 2 gid1: 0 lid0: 15 lid1:  2 start:  0  end: 48
...

So if my assumptions about the variable values are even close to correct you have barrier divergence issue there. Some threads encounter a barrier which another threads never will. I'm surprised it did not deadlock.

Upvotes: 2

DarkZeros
DarkZeros

Reputation: 8410

The first thing I see it can terribly fail, is that you are using barriers inside a for loop.

If all the threads do not enter the same amount of times the for loop. Then the results are undefined completely. And you clearly state the problem only occurs if the for loop runs more than once.

Do you ensure this condition?

Upvotes: 1

Related Questions