Reputation: 321
In the first code example (kernel_conv), I programmed a simple convolution and it worked with the expected result [1,1,2,1,1].
Then I used the elementwise kernel to sum all entries of a vector. However, if I run the second example (kernel_sum) I get the result [3,0,0] but would expect [6,0,0].
What is the difference between these two examples? Why is the variable y in the first example updated and in the second it seems to be overwritten?
import numpy as np
import cupy as cp
kernel_conv = cp.ElementwiseKernel(
'raw float32 x', 'raw float32 y',
''' int idx = i*2 + 1;
for(size_t j=0;j<3;j++){
y[idx - 1 + j] += x[j];
}
''', 'conv')
x = cp.asarray(np.array([1,1,1]),dtype=np.float32)
y = cp.zeros((5,),dtype=np.float32)
z = kernel_conv(x,y,size=2)
print(z)
kernel_sum = cp.ElementwiseKernel(
'raw float32 x', 'raw float32 y',
'''
y[0] += x[i]
''', 'summe')
x = cp.asarray(np.array([1, 2, 3]), dtype=np.float32)
y = cp.zeros((3,),dtype=np.float32)
z = kernel_sum(x,y,size=3)
print(z)
Upvotes: 0
Views: 842
Reputation: 341
The incorrect result of kernel_sum
is due to a data race. In this case, 3 threads try to write to the same address of the global memory (y[0]
) at the same time. To avoid data race, you should 1) use atomicAdd
or 2) use cupy.ReductionKernel
for reduction.
Actually, the kernel_conv
also has a data race. The first thread running y[2] += x[2]
may conflict with the second thread running y[2] += x[0]
. Since the first one was slightly lagged in the actual execution, the result was not affected, but this is a timing problem and not guaranteed in general*. To correct it, you can use atomicAdd
here again, or you can also change the way to run the computation by multiple threads (e.g. launching 5 threads each of which computes a distinct element of y
).
* Indeed, in the above example of kernel_conv
, I guess the correctness IS guaranteed when all the threads are running in the same warp, i.e., the number of threads is not larger than 32. It is because all the threads in the same warp are running synchronously until divergence by control flows. If the number of threads is set to a larger value, a data race may happen at a warp boundary.
Upvotes: 1