Reputation: 31

OpenCL and GPU global synchronization

Has anyone tried the gpu_sync functions described in the article "Inter-Block GPU Communication via Fast Barrier Synchronization"? All the codes described seems pretty simple and easy to implement but it keeps freezing up my GPU. I'm sure I'm doing something stupid but I can't see what. Can anyone help me?

The strategy I'm using is the one described in the section “GPU Lock-Free Synchronization” and here is the OpenCL source code I've implemented:

static void globalSync(uint iGoalValue,
                   volatile __global int *globalSyncFlagsIN,
                   volatile __global int *globalSyncFlagsOUT)
{
 const size_t iLocalThreadID  = get_local_id(0);
 const size_t iWorkGroupID    = get_group_id(0);
 const size_t iWorkGroupCount = get_num_groups(0);

 //Only the first thread on each SM is used for synchronization
 if (iLocalThreadID == 0)
 { globalSyncFlagsIN[iWorkGroupID] = iGoalValue; }

 if (iWorkGroupID == 0)
 {
  if (iLocalThreadID < iWorkGroupCount)
  {
   while (globalSyncFlagsIN[iLocalThreadID] != iGoalValue) {
    // Nothing to do here
   }
  }

  barrier(CLK_GLOBAL_MEM_FENCE);

  if (iLocalThreadID < iWorkGroupCount)
  { globalSyncFlagsOUT[iLocalThreadID] = iGoalValue; }
 }

 if (iLocalThreadID == 0)
 {
  while (globalSyncFlagsOUT[iWorkGroupID] != iGoalValue) {
   // Nothing to do here 
  }
 }

 barrier(CLK_GLOBAL_MEM_FENCE);
}

Thanks in advance.

Upvotes: 2

Answers (2)

Jungwook Park

Reputation: 1

Must be too late but just for the reference. Unfortunately this is not going to work because barrier() only works across the work items in the same workgroup. i.e., user can only specify the address_space but not memory_scope. The builtin has been renamed as work_group_barrier to avoid that confusion. (barrier() is still supported for the backward compatibility) https://registry.khronos.org/OpenCL/sdk/2.0/docs/man/xhtml/work_group_barrier.html atomic extensions might be helpful, such as atomic_inc on __global.

Upvotes: 0

Sergei Kulik

Reputation: 353

I haven't tried running the code, but the direct translation from CUDA to OpenCL of the code from the article mentioned above would be:

{  
    int tid_in_blk = get_local_id(0) * get_local_size(1)
        + get_local_id(1);
    int nBlockNum = get_num_groups(0) * get_num_groups(1);
    int bid = get_group_id(0) * get_num_groups(1) + get_group_id(1);


    if (tid_in_blk == 0) {
        Arrayin[bid] = goalVal;
    }

    if (bid == 1) {
        if (tid_in_blk < nBlockNum) {
            while (Arrayin[tid_in_blk] != goalVal){

            }
        }
        barrier(CLK_LOCAL_MEM_FENCE);

        if (tid_in_blk < nBlockNum) {
            Arrayout[tid_in_blk] = goalVal;
        }
    }

    if (tid_in_blk == 0) {
        while (Arrayout[bid] != goalVal) {

        }
    }
}

Please note the difference in thread and group IDs and in using local memory barrier instead of global one.

Upvotes: 2

OpenCL and GPU global synchronization

Answers (2)

Related Questions