CuriousPenguin
CuriousPenguin

Reputation: 97

How to Synchronize threads in CUDA without using atomic

I am learning CUDA programming from the online UDACITY course. A sample code was given in the second lesson which has basicaaly two kernells, the first one __global__ void increment_naive(int *g) simply adds 1 to the element of the array *g which resides in global memory.

The entire code according to UDACITY is as follow:

#include <stdio.h>
#include "gputimer.h"

#define NUM_THREADS 1000000
#define ARRAY_SIZE  100

#define BLOCK_WIDTH 1000

void print_array(int *array, int size)
{
    printf("{ ");
    for (int i = 0; i < size; i++)  { printf("%d ", array[i]); }
    printf("}\n");
}

__global__ void increment_naive(int *g)
{
    // which thread is this?
    int i = blockIdx.x * blockDim.x + threadIdx.x; 

    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    i = i % ARRAY_SIZE;  
    g[i] = g[i] + 1;
}

__global__ void increment_atomic(int *g)
{
    // which thread is this?
    int i = blockIdx.x * blockDim.x + threadIdx.x; 

    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    i = i % ARRAY_SIZE;  
    atomicAdd(& g[i], 1);
}

int main(int argc,char **argv)
{   
    GpuTimer timer;
    printf("%d total threads in %d blocks writing into %d array elements\n",
           NUM_THREADS, NUM_THREADS / BLOCK_WIDTH, ARRAY_SIZE);

    // declare and allocate host memory
    int h_array[ARRAY_SIZE];
    const int ARRAY_BYTES = ARRAY_SIZE * sizeof(int);

    // declare, allocate, and zero out GPU memory
    int * d_array;
    cudaMalloc((void **) &d_array, ARRAY_BYTES);
    cudaMemset((void *) d_array, 0, ARRAY_BYTES); 

    // launch the kernel - comment out one of these
    timer.Start();


    increment_naive<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);
    //increment_atomic<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);
    timer.Stop();

    // copy back the array of sums from GPU and print
    cudaMemcpy(h_array, d_array, ARRAY_BYTES, cudaMemcpyDeviceToHost);
    print_array(h_array, ARRAY_SIZE);
    printf("Time elapsed = %g ms\n", timer.Elapsed());

    // free GPU memory allocation and exit
    cudaFree(d_array);
    return 0;
}

According to the program, a million threads with 1000 blocks are writing into 10 array elements. Thus, each array element will have a result of 100000.

The first kernel fails to produce the required output as threads are not-synchronously accessing producing undesirable results. This can be solved using barriers such as __syncthreads or by using atomic operations.

The Second kernell works fine and produces the correct output which is as follow:

1000000 total threads in 1000 blocks writing into 100 array elements
{ 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 }
Time elapsed = 0.367648 ms

As previously stated, the first kernell produces wrong output every time.

1000000 total threads in 1000 blocks writing into 100 array elements
{ 75 75 75 75 78 78 78 78 73 73 73 73 82 82 82 82 85 85 85 85 92 92 92 92 104 104 104 104 107 107 107 107 89 89 89 89 88 88 88 88 95 95 95 95 103 103 103 103 106 106 106 106 107 107 107 107 105 105 105 105 113 113 113 113 96 96 96 96 95 95 95 95 95 95 95 95 100 100 100 100 98 98 98 98 104 104 104 104 110 110 110 110 126 126 126 126 90 90 90 90 }
Time elapsed = 0.23392 ms

I am trying to fix the first kernell by placing barriers at different stages of the computation but am failing to get the necessary output. My attempt to fixing the first kernell is as follow:

    __global__ void increment_naive(int *g)
{
    // which thread is this?
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    __syncthreads();
    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    //i = i % ARRAY_SIZE;
    int temp = i%ARRAY_SIZE;
    __syncthreads();
    i = temp;
    __syncthreads();
    //g[i] = g[i] + 1;
    int temp1 = g[i]+1;
    __syncthreads();
    g[i] = temp1;
     __syncthreads();

}

I would like someone to guide me through this as this issue is bothering me a lot hampering my confidence in progressing further.

Upvotes: 0

Views: 1524

Answers (1)

CygnusX1
CygnusX1

Reputation: 21778

The __syncthreads() function ensures that all threads in a block are at the same spot in the code. Using those is not going to achieve what you want. Even worse - suppose that CUDA was a perfect parallel machine, with all threads working in lockstep. You wouldn't ever need any __syncthreads. Still, you would have a different result. Consider the following pseudocode and explanation on what is going on:

__perfect_parallel_machine__ void increment_naive(int *g)
{
    int idx = thisThreadIdx % ARRAY_SIZE;
    int local = g[idx];
                               //*all* threads load the initial value of g[idx]
                               //each thread holds a separate copy of 'local' variable
                               //local=0 in each thread
    local = local + 1;
                               //each thread increment its own private copy of 'local'
                               //local=1 for all threads
    g[idx] = local;
                               //each thread stores the same value (1) into global array
                               //g = {1, 1, 1, 1, 1, ...., 1}
}

Since CUDA is not a perfect parallel machine, things happen out of order, and you end up getting higher values in your array. Putting more synchronizaton barriers is going to put you closer to the ideal {1, 1, ... , 1} result.

There are other barrier functions, such as __threadfence(). This one stalls the current thread (only the current!) until a store to global array is guaranteed to be visible by other threads. This is related to L1/L2 caching and has nothing to do with thread synchronization. It is common, for example, to use __threadfence in conjunction with atomics to flag that you finished filling in some data.

I think there must be some misunderstanding between you and your tutor. I would suggest talking to him to clarify it...

Upvotes: 2

Related Questions