Reputation: 43
I'm trying to transform a texture to the frequency domain via a compute shader in unity/CG/hlsl, i.e. I'm trying to read pixel values from a texture and output an array of basis function coefficients. how would i go about that? i am really new to compute shader's so i'm a bit lost. I understand the reason for the race condition and how compute shaders divide the workload but is there any way to deal with this? In general documentation on buffers and other things seems a little underwhelming for someone without a background in the matter..
the error i am getting:
Shader error in 'Compute.compute': race condition writing to shared resource detected, consider making this write conditional. at kernel testBuffer at Compute.compute(xxx) (on d3d11)
a simplified example could be to sum all the pixel values, currently my approach would be as follows. I am attempting to use structuredbuffers since i don't know how else i would be able to retrieve the data or store it on gpu for global shader access afterwards??
struct valueStruct{
float4 values[someSize];
}
RWStructuredBuffer<valueStruct> valueBuffer;
// same behaviour if using RWStructuredBuffer<float3> valueBuffer;
// if using 'StructuredBuffer<float3> valueBuffer;' i get the error:
// Shader error in 'Compute.compute': l-value specifies const object at kernel testBuffer at Compute.compute(xxx) (on d3d11)
Texture2D<float4> Source;
[numthreads(8, 8, 1)]
void testBuffer(uint3 id : SV_DispatchThreadID) {
valueBuffer[0].values[0] += Source[id.xy]; // in theory the vaules
valueBuffer[0].values[1] += Source[id.xy]; // would be different
valueBuffer[0].values[2] += Source[id.xy]; // but it doesn't really
valueBuffer[0].values[3] += Source[id.xy]; // matter for this, so
valueBuffer[0].values[4] += Source[id.xy]; // they are just Source[id.xy]
//.....
}
The whole thing does not throw a race condition error thing if i unfold the buffer into single values like
float3 value0;
float3 value1;
float3 value2;
float3 value3;
float3 value4;
float3 value5;
float3 value6;
float3 value7;
float3 value8;
[numthreads(8, 8, 1)]
void testBuffer(uint3 id : SV_DispatchThreadID) {
value0 += Source[id.xy]; // in theory the vaules
value1 += Source[id.xy]; // would be different
value1 += Source[id.xy]; // but it doesn't really
value1 += Source[id.xy]; // matter for this, so
value1 += Source[id.xy]; // they are just Source[id.xy]
}
and don't use a structuredbuffer, but in that case i don't know how to retrieve the data after kernel dispatch. If it's down to the READ part of the RWStructuredBuffer i am using but what would be an equivalent buffer that i can only write to? Since i don't really read the data. Or is the general operator "+=" already causing a race condition no matter what?
from google i found that a solution might be using GroupMemoryBarrierWithGroupSync();
?? but i have no idea what this is (not to mention how it works) and in general google results are just flying a little over my head atm
could anyone provide an example of how to solve this issue? Otherwise i appreaciate any pointers.
Upvotes: 4
Views: 4032
Reputation: 1329
First of all, a race condition happens whenever one thread writes to a memory location while another thread reads OR writes from/to that same location. So, yes, +=
is already causing a race condition, and there is no "easy" way to fix this. (Btw: +=
implicitly reads the value, because you can't calculate the sum of two values without knowing them)
GroupMemoryBarrierWithGroupSync()
inserts a memory barrier, which means: The current thread stops at that line until all threads in the current group have reached that line. This is important if one thread writes to a memory location, and another thread needs to read from that location afterwards. So on its own, it doesn't help you at all (but it is required in the following algorithm).
Now, a common sollution to calculate the sum of all pixels (or anything similar) is to calculate the sum in parallel. The idea is that every thread reads two pixels, and writes their sum to an own index in an groupshared
array (Notice: There are no race conditions happening, since every thread has it's own memory to write to, no two threads write to the same location). Then, half of the threads read each two values out of this array, and write their sum back, and so on, until there's only one value remaining. At that point, we have calculated the sum of all pixels in the area that this group covers (in your case, we have summed 8x8=64 pixels). Now, one thread in that group (for example that one that's SV_GroupIndex
is zero, which is only true for the first thread in every group) writes that sum back to an RWStructuredBuffer
at an index that's specific to that thread group (so, again, no race condition is happening). This process in then repeated until all values were summed.
As a more in-depth explanaition for the algorithm, see NVIDIA's Parallel Reduction Whitepaper (Notice that their code is in CUDA, so while it works very similar in HLSL, syntax and function names may differ).
Now, this is all just about calculating the sum of all pixels. Calculating the frequency domain may be a bit complexer or even requires a different solution all together, since the total size of groupshared
memory per group is limited (16KB on DX10 Hardware)
Edit:
Small sample code in HLSL (This assumes that the image was already loaded into a linear StructuredBuffer), calculating the sum of 128 consecutive pixels:
StructuredBuffer<float> Source : register(t0);
RWStructuredBuffer<float> Destination : register(u0);
groupshared float TotalSum[64];
[numthreads(64,1,1)]
void mainCS(uint3 groupID : SV_GroupID, uint3 dispatchID : SV_DispatchThreadID, uint groupIndex : SV_GroupIndex)
{
uint p = dispatchID.x * 2;
float l = Source.Load(p);
float r = Source.Load(p + 1);
TotalSum[groupIndex] = l + r;
GroupMemoryBarrierWithGroupSync();
for(uint k = 32; k > 0; k >>= 1)
{
if(groupIndex < k)
{
TotalSum[groupIndex] += TotalSum[groupIndex + k];
}
GroupMemoryBarrierWithGroupSync();
}
if(groupIndex == 0) { Destination[groupID.x] = TotalSum[0]; }
}
Upvotes: 7