brucelin
brucelin

Reputation: 26

why the result is correct without __syncthreads()?

The CUDA kernel function is used to calculate inner product for two vectors. The kernel used multiple threads to calculate products concurrently, and then uses one thread to calculate inner product. My question is why the result would be correct without __syncthreads().

__global__ void dot( int *a, int *b, int *c, int *dot ){
    int tid = threadIdx.x;
    int i;

    c[tid] = a[tid] * b[tid];
    
    //__syncthreads();
    //need synchronize?? 
    if(tid==0){
        for(i=0; i<N; i++){
            *dot += c[i];
        }
    }
}

Upvotes: 0

Views: 243

Answers (1)

Hyunwoo Kim
Hyunwoo Kim

Reputation: 51

In your for loop, thread #0 is accessing every elements in array c (results by other threads). So the result would be wrong unless you secure that every calculation on array c is completed by using __syncthreads();.

Upvotes: 1

Related Questions