Reputation: 131
I am trying to implement a high dynamic range algorithm on the GPU for which i need to calculate a histogram. The metal code looks like this so far:
kernel void
hist(texture2d_array<half, access::read> inArray [[texture(0)]],
device float *t [[buffer(0)]], // ignore this
volatile device uint *histogram [[buffer(1)]],
uint2 gid [[thread_position_in_grid]]){
int4 Y_ldr;
uint redChannel;
for(uint i = 0; i < inArray.get_array_size(); i++){
Y_ldr = int4(inArray.read(gid, i, 0) * 255);
redChannel = Y_ldr.r;
histogram[redChannel]++;
}
}
The Kernel fills half of the histogram (256 entries) with huge numbers, the other half is null (initial values). Instead, when I write
histogram[0] = 1; // just a number
histogram[0] = redChannel; // OR this
I get the correct number at position 0 in both cases. Using atomic_uint doesn't help, so does the threadgroup barrier. Trying
histogram[0]++;
reveals that metal does not handle data hazards automatically, but the numbers are small, i.e. like 12000. So, what is causing the trouble that
If you need to know, how the pipeline state is set up, see here:
var threadGroupCount = MTLSizeMake(8, 8, 1)
var threadgroups = MTLSizeMake(pictures!.width/threadGroupCount.width, pictures!.height/threadGroupCount.height, 1)
computeCommandEncoder.setComputePipelineState(hist!)
computeCommandEncoder.setTexture(pictures, atIndex: 0)
computeCommandEncoder.setBuffer(exposure_times, offset: 0, atIndex: 0)
computeCommandEncoder.setBuffer(histogram, offset: 0, atIndex: 1) // <-- this is important!!!CommandEncoder.dispatchThreadgroups(threadgroups, threadsPerThreadgroup: threadGroupCount)
Upvotes: 3
Views: 820
Reputation: 131
jesus -.- The iPad CPU interprets UInt as a 64bit number (8 bytes). To the GPU, Int implicates 32bit (4 bytes). When the pointer is set at position [1], the CPU interprets this position a the upper 4 bytes of the array entry [0]. That causes the insanely huge numbers. To solve this issue, I have to set up the histogram to be [UInt32] on the CPU side.
Upvotes: 3