hlsl CG compute shader Race Condition

Question

I'm trying to transform a texture to the frequency domain via a compute shader in unity/CG/hlsl, i.e. I'm trying to read pixel values from a texture and output an array of basis function coefficients. how would i go about that? i am really new to compute shader's so i'm a bit lost. I understand the reason for the race condition and how compute shaders divide the workload but is there any way to deal with this? In general documentation on buffers and other things seems a little underwhelming for someone without a background in the matter..

the error i am getting: Shader error in 'Compute.compute': race condition writing to shared resource detected, consider making this write conditional. at kernel testBuffer at Compute.compute(xxx) (on d3d11)

a simplified example could be to sum all the pixel values, currently my approach would be as follows. I am attempting to use structuredbuffers since i don't know how else i would be able to retrieve the data or store it on gpu for global shader access afterwards??

struct valueStruct{
float4 values[someSize];
}

RWStructuredBuffer valueBuffer;

// same behaviour if using RWStructuredBuffer valueBuffer;
// if using 'StructuredBuffer valueBuffer;' i get the error:
// Shader error in 'Compute.compute': l-value specifies const object at kernel testBuffer at Compute.compute(xxx) (on d3d11)

Texture2D Source;

[numthreads(8, 8, 1)]
void testBuffer(uint3 id : SV_DispatchThreadID) {

      valueBuffer[0].values[0] +=  Source[id.xy];  // in theory the vaules 
      valueBuffer[0].values[1] +=  Source[id.xy];  // would be different
      valueBuffer[0].values[2] +=  Source[id.xy];  // but it doesn't really 
      valueBuffer[0].values[3] +=  Source[id.xy];  // matter for this, so 
      valueBuffer[0].values[4] +=  Source[id.xy];  // they are just Source[id.xy]
      //.....

}

The whole thing does not throw a race condition error thing if i unfold the buffer into single values like

    float3 value0;
    float3 value1;
    float3 value2;
    float3 value3;
    float3 value4;
    float3 value5;
    float3 value6;
    float3 value7;
    float3 value8;


[numthreads(8, 8, 1)]
void testBuffer(uint3 id : SV_DispatchThreadID) {

      value0 +=  Source[id.xy];  // in theory the vaules 
      value1 +=  Source[id.xy];  // would be different
      value1 +=  Source[id.xy];  // but it doesn't really 
      value1 +=  Source[id.xy];  // matter for this, so 
      value1 +=  Source[id.xy];  // they are just Source[id.xy]
}

and don't use a structuredbuffer, but in that case i don't know how to retrieve the data after kernel dispatch. If it's down to the READ part of the RWStructuredBuffer i am using but what would be an equivalent buffer that i can only write to? Since i don't really read the data. Or is the general operator "+=" already causing a race condition no matter what?

from google i found that a solution might be using GroupMemoryBarrierWithGroupSync(); ?? but i have no idea what this is (not to mention how it works) and in general google results are just flying a little over my head atm

could anyone provide an example of how to solve this issue? Otherwise i appreaciate any pointers.

Bizzarrus · Accepted Answer

First of all, a race condition happens whenever one thread writes to a memory location while another thread reads OR writes from/to that same location. So, yes, += is already causing a race condition, and there is no "easy" way to fix this. (Btw: += implicitly reads the value, because you can't calculate the sum of two values without knowing them)

GroupMemoryBarrierWithGroupSync() inserts a memory barrier, which means: The current thread stops at that line until all threads in the current group have reached that line. This is important if one thread writes to a memory location, and another thread needs to read from that location afterwards. So on its own, it doesn't help you at all (but it is required in the following algorithm).

Now, a common sollution to calculate the sum of all pixels (or anything similar) is to calculate the sum in parallel. The idea is that every thread reads two pixels, and writes their sum to an own index in an groupshared array (Notice: There are no race conditions happening, since every thread has it's own memory to write to, no two threads write to the same location). Then, half of the threads read each two values out of this array, and write their sum back, and so on, until there's only one value remaining. At that point, we have calculated the sum of all pixels in the area that this group covers (in your case, we have summed 8x8=64 pixels). Now, one thread in that group (for example that one that's SV_GroupIndex is zero, which is only true for the first thread in every group) writes that sum back to an RWStructuredBuffer at an index that's specific to that thread group (so, again, no race condition is happening). This process in then repeated until all values were summed.

As a more in-depth explanaition for the algorithm, see NVIDIA's Parallel Reduction Whitepaper (Notice that their code is in CUDA, so while it works very similar in HLSL, syntax and function names may differ).

Now, this is all just about calculating the sum of all pixels. Calculating the frequency domain may be a bit complexer or even requires a different solution all together, since the total size of groupshared memory per group is limited (16KB on DX10 Hardware)

Edit:

Small sample code in HLSL (This assumes that the image was already loaded into a linear StructuredBuffer), calculating the sum of 128 consecutive pixels:

StructuredBuffer Source : register(t0);
RWStructuredBuffer Destination : register(u0);
groupshared float TotalSum[64];

[numthreads(64,1,1)]
void mainCS(uint3 groupID : SV_GroupID, uint3 dispatchID : SV_DispatchThreadID, uint groupIndex : SV_GroupIndex)
{
    uint p = dispatchID.x * 2;
    float l = Source.Load(p);
    float r = Source.Load(p + 1);
    TotalSum[groupIndex] = l + r;
    GroupMemoryBarrierWithGroupSync();
    for(uint k = 32; k > 0; k >>= 1)
    {
        if(groupIndex < k)
        {
            TotalSum[groupIndex] += TotalSum[groupIndex + k];
        }
        GroupMemoryBarrierWithGroupSync();
    }
    if(groupIndex == 0) { Destination[groupID.x] = TotalSum[0]; }
}

hlsl CG compute shader Race Condition

Answers (1)

Related Questions