Chris
Chris

Reputation: 1405

Optimize metal compute shader for image histogram

I have a metal shader that computes an image histogram like this:

#define CHANNEL_SIZE (256)
typedef atomic_uint HistoBuffer[CHANNEL_SIZE];

kernel void
computeHisto(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
             device HistoBuffer &histo [[buffer(0)]],
             uint2  grid [[thread_position_in_grid]]) {
  if (grid.x >= sourceTexture.get_width() || grid.y >= sourceTexture.get_height()) { return; }

  half gray  = sourceTexture.read(grid).r;
  uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
  atomic_fetch_add_explicit(&histo[grayvalue], 1, memory_order_relaxed);
}

This works as expected but takes too long (>1ms). I now tried to optimise this by reducing the number of atomic operations. I came up with the following improved code. The idea is to compute local histograms per thread group and add them later atomically into the global hist buffer.

kernel void
computeHisto_fast(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
             device HistoBuffer &histo [[buffer(0)]],
             uint2 t_pos_grid [[thread_position_in_grid]],
             uint2 tg_pos_grid [[ threadgroup_position_in_grid ]],
             uint2 t_pos_tg [[ thread_position_in_threadgroup]],
             uint  t_idx_tg [[ thread_index_in_threadgroup ]],
             uint2 t_per_tg [[ threads_per_threadgroup ]]
             )
{
  threadgroup uint localhisto[CHANNEL_SIZE] = { 0 };

  if (t_pos_grid.x >= sourceTexture.get_width() || t_pos_grid.y >= sourceTexture.get_height()) { return; }

  half gray  = sourceTexture.read(t_pos_grid).r;
  uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
  localhisto[grayvalue]++;

  // wait for all threads in threadgroup to finish
  threadgroup_barrier(mem_flags::mem_none);

  // copy the thread group result atomically into global histo buffer
  if(t_idx_tg == 0) {
    for(uint i=0;i<CHANNEL_SIZE;i++) {
      atomic_fetch_add_explicit(&histo[i], localhisto[i], memory_order_relaxed);
    }
  }
}

There are 2 problems:

  1. The improved routine does not yield identical results compared to the first and I currently don't see why ?
  2. The run time didn't improve. in fact it takes 4 times the runtime of the unoptimised version. According to the debugger the for loop is the problem. But I do not understand this, since the number of atomic operation is reduced by 3 orders of magnitude, i.e. the thread group size, here (32x32)=1024.

Anbody who can explain what I am doing wrong here ? Thanks

EDIT: 2019-12-22: According to Matthijs answer I have changed the local histogram also to atomic operations like this:

threadgroup atomic_uint localhisto[CHANNEL_SIZE] = {0};

half gray  = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&localhisto[grayvalue], 1, memory_order_relaxed);

However the result sill is not the same as in the reference implementation above. There must be another severe conceptional bug ???

Upvotes: 2

Views: 802

Answers (2)

middle
middle

Reputation: 61

I think the problem is with initializing shared memory, I don't think this definition does the job. Also, threadgroup level memory synchronization is required between zeroing shared memory and atomic update.

As for the device memory update, doing it using a single thread is clearly suboptimal. Updating the whole 256 length histogram in each threadblock can have a huge overhead depending on the size of the threadblock.

A sample I used for a small (16 element) histogram using 8x8 threadblocks:

kernel void gaussian_filter(device const uchar* data,
                            device atomic_uint* p_hist,
                            uint2 imageShape [[threads_per_grid]],
                            uint2 idx [[thread_position_in_grid]],
                            uint tidx [[thread_index_in_threadgroup]])
{
    threadgroup atomic_uint sh_hist[16];
    if (tidx < 16)
        atomic_store_explicit(sh_hist + tidx, 0, memory_order_relaxed);
    threadgroup_barrier(mem_flags::mem_threadgroup);
    uint histBin = (uint)data[imageShape[0]*idx[1] + idx[0]]/16;
    atomic_fetch_add_explicit(sh_hist + histBin, 1, memory_order_relaxed);
    threadgroup_barrier(mem_flags::mem_threadgroup);
    if (tidx < 16)
        atomic_fetch_add_explicit(p_hist + tidx, atomic_load_explicit(sh_hist + tidx, memory_order_relaxed), memory_order_relaxed);
}

Upvotes: 0

Matthijs Hollemans
Matthijs Hollemans

Reputation: 7892

You'll still need to use atomic operations on the threadgroup memory, since it's still being shared by multiple threads. This should be faster than in your first version because there is less contention for the same locks.

Upvotes: 3

Related Questions