Reputation: 26
The CUDA kernel function is used to calculate inner product for two vectors. The kernel used multiple threads to calculate products concurrently, and then uses one thread to calculate inner product. My question is why the result would be correct without __syncthreads().
__global__ void dot( int *a, int *b, int *c, int *dot ){
int tid = threadIdx.x;
int i;
c[tid] = a[tid] * b[tid];
//__syncthreads();
//need synchronize??
if(tid==0){
for(i=0; i<N; i++){
*dot += c[i];
}
}
}
Upvotes: 0
Views: 243
Reputation: 51
In your for loop, thread #0 is accessing every elements in array c (results by other threads). So the result would be wrong unless you secure that every calculation on array c is completed by using __syncthreads();.
Upvotes: 1