why the result is correct without __syncthreads()?

Question

The CUDA kernel function is used to calculate inner product for two vectors. The kernel used multiple threads to calculate products concurrently, and then uses one thread to calculate inner product. My question is why the result would be correct without __syncthreads().

__global__ void dot( int *a, int *b, int *c, int *dot ){
    int tid = threadIdx.x;
    int i;

    c[tid] = a[tid] * b[tid];
    
    //__syncthreads();
    //need synchronize?? 
    if(tid==0){
        for(i=0; i

why the result is correct without __syncthreads()?

Answers (1)

Related Questions