Reputation: 13
I'm trying to downscale an image using bilinear interpolation, so I have made a native C++ implementation however it ended up being absurdly slow, but since I'm using a POWER8 I decided to use AltiVec SIMD in an attempt to accelerate the algorithm, but I didn't find an instruction that could read all the 4 pixels at the same time.
Also, here are some notes that might be helpful:
Canvas
array is a uint8* which holds the cell statesCenter
is just the canvas size divided by 2So this is the native C++ implementation
auto C00 = Palette[Canvas[FLATTEN_2D(X * 2 + 0, Y * 2 + 0, GridSize.X)]].RGBA;
auto C01 = Palette[Canvas[FLATTEN_2D(X * 2 + 0, Y * 2 + 1, GridSize.X)]].RGBA;
auto C10 = Palette[Canvas[FLATTEN_2D(X * 2 + 1, Y * 2 + 0, GridSize.X)]].RGBA;
auto C11 = Palette[Canvas[FLATTEN_2D(X * 2 + 1, Y * 2 + 1, GridSize.X)]].RGBA;
size_t Index = FLATTEN_2D(X, Y, Center.X) * 3;
NewImage[Index + 0] = uint8_t(((float)C00.R + (float)C01.R + (float)C10.R + (float)C11.R) / 4.f);
NewImage[Index + 1] = uint8_t(((float)C00.G + (float)C01.G + (float)C10.G + (float)C11.G) / 4.f);
NewImage[Index + 2] = uint8_t(((float)C00.B + (float)C01.B + (float)C10.B + (float)C11.B) / 4.f);
(yes I am aware that my code looks horrible and it could be optimized without using AltiVec but that wouldn't be fun)
And here's the half-done implementation that uses AltiVec
#include <altivec.h>
typedef __vector int8_t int8x16_p;
typedef __vector uint8_t uint8x16_p;
typedef __vector int16_t int16x8_p;
typedef __vector uint16_t uint16x8_p;
typedef __vector int32_t int32x4_p;
typedef __vector uint32_t uint32x4_p;
typedef __vector float fp32x4_p;
typedef __vector double fp64x2_p;
// ...
int32x4_p IndexesX = vec_add(vec_splats((int32_t)X * 2), (int32x4_p){ 0, 0, 1, 1 });
int32x4_p IndexesY = vec_add(vec_splats((int32_t)Y * 2), (int32x4_p){ 0, 1, 0, 1 });
int32x4_p Indexes = vec_add(
vec_mul(vec_splats((int32_t)GridSize.X), IndexesX),
IndexesY
);
// Somehow load 4 uint8 elements in the Canvas array
// using the first 4 Canvas + Index integers as the memory location
uint8x16_p States = ???(Indexes, (uint8_t*)Canvas);
// And then somehow index each value in States (up to the 4th item) to Palette and load as 4 uint8 values (R, G, B, A)
// maybe we could load all the values into a single vector?
uint8x16_p ColorValues = ???(States, (uint32_t*)PaletteData);
// or maybe not?
uint8x16_p C00 = ???(vec_extract(States, 0), (uint32_t*)PaletteData);
uint8x16_p C01 = ???(vec_extract(States, 0), (uint32_t*)PaletteData);
uint8x16_p C10 = ???(vec_extract(States, 0), (uint32_t*)PaletteData);
uint8x16_p C11 = ???(vec_extract(States, 0), (uint32_t*)PaletteData);
// and finally somehow average all vectors
I'd also be very thankful if anyone has some updated resources on the AltiVec intrinsics. So far here's what I have found (just in case someone finds this in the future):
https://www.ibm.com/docs/en/xl-c-aix/13.1.2?topic=functions-vector-built-in
https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.0?topic=functions-vector-built-in
https://www.nxp.com/docs/en/reference-manual/ALTIVECPIM.pdf
Upvotes: 0
Views: 50