DerJFK
DerJFK

Reputation: 321

Problem with summing entries using the elementwise kernel in cupy

In the first code example (kernel_conv), I programmed a simple convolution and it worked with the expected result [1,1,2,1,1].

Then I used the elementwise kernel to sum all entries of a vector. However, if I run the second example (kernel_sum) I get the result [3,0,0] but would expect [6,0,0].

What is the difference between these two examples? Why is the variable y in the first example updated and in the second it seems to be overwritten?

import numpy as np 
import cupy as cp 

kernel_conv = cp.ElementwiseKernel(
    'raw float32 x', 'raw float32 y',
    ''' int idx = i*2 + 1;
        for(size_t j=0;j<3;j++){
          y[idx - 1 + j] += x[j];
        }
    ''', 'conv')

x = cp.asarray(np.array([1,1,1]),dtype=np.float32)
y = cp.zeros((5,),dtype=np.float32)
z = kernel_conv(x,y,size=2)
print(z)

kernel_sum = cp.ElementwiseKernel(
  'raw float32 x', 'raw float32 y',
  ''' 
      y[0] += x[i]
  ''', 'summe')

x = cp.asarray(np.array([1, 2, 3]), dtype=np.float32)
y = cp.zeros((3,),dtype=np.float32)
z = kernel_sum(x,y,size=3)
print(z)

Upvotes: 0

Views: 842

Answers (1)

Seiya Tokui
Seiya Tokui

Reputation: 341

The incorrect result of kernel_sum is due to a data race. In this case, 3 threads try to write to the same address of the global memory (y[0]) at the same time. To avoid data race, you should 1) use atomicAdd or 2) use cupy.ReductionKernel for reduction.

Actually, the kernel_conv also has a data race. The first thread running y[2] += x[2] may conflict with the second thread running y[2] += x[0]. Since the first one was slightly lagged in the actual execution, the result was not affected, but this is a timing problem and not guaranteed in general*. To correct it, you can use atomicAdd here again, or you can also change the way to run the computation by multiple threads (e.g. launching 5 threads each of which computes a distinct element of y).

* Indeed, in the above example of kernel_conv, I guess the correctness IS guaranteed when all the threads are running in the same warp, i.e., the number of threads is not larger than 32. It is because all the threads in the same warp are running synchronously until divergence by control flows. If the number of threads is set to a larger value, a data race may happen at a warp boundary.

Upvotes: 1

Related Questions