rafa_br34
rafa_br34

Reputation: 13

loading 4 uint8_t elements at different memory locations using AltiVec

I'm trying to downscale an image using bilinear interpolation, so I have made a native C++ implementation however it ended up being absurdly slow, but since I'm using a POWER8 I decided to use AltiVec SIMD in an attempt to accelerate the algorithm, but I didn't find an instruction that could read all the 4 pixels at the same time.
Also, here are some notes that might be helpful:

  1. The Canvas array is a uint8* which holds the cell states
  2. Center is just the canvas size divided by 2

So this is the native C++ implementation

auto C00 = Palette[Canvas[FLATTEN_2D(X * 2 + 0, Y * 2 + 0, GridSize.X)]].RGBA;
auto C01 = Palette[Canvas[FLATTEN_2D(X * 2 + 0, Y * 2 + 1, GridSize.X)]].RGBA;
auto C10 = Palette[Canvas[FLATTEN_2D(X * 2 + 1, Y * 2 + 0, GridSize.X)]].RGBA;
auto C11 = Palette[Canvas[FLATTEN_2D(X * 2 + 1, Y * 2 + 1, GridSize.X)]].RGBA;

size_t Index = FLATTEN_2D(X, Y, Center.X) * 3;

NewImage[Index + 0] = uint8_t(((float)C00.R + (float)C01.R + (float)C10.R + (float)C11.R) / 4.f);
NewImage[Index + 1] = uint8_t(((float)C00.G + (float)C01.G + (float)C10.G + (float)C11.G) / 4.f);
NewImage[Index + 2] = uint8_t(((float)C00.B + (float)C01.B + (float)C10.B + (float)C11.B) / 4.f);

(yes I am aware that my code looks horrible and it could be optimized without using AltiVec but that wouldn't be fun)

And here's the half-done implementation that uses AltiVec

#include <altivec.h>

typedef __vector int8_t int8x16_p;
typedef __vector uint8_t uint8x16_p;
typedef __vector int16_t int16x8_p;
typedef __vector uint16_t uint16x8_p;
typedef __vector int32_t int32x4_p;
typedef __vector uint32_t uint32x4_p;
typedef __vector float fp32x4_p;
typedef __vector double fp64x2_p;

// ...

int32x4_p IndexesX = vec_add(vec_splats((int32_t)X * 2), (int32x4_p){ 0, 0, 1, 1 });
int32x4_p IndexesY = vec_add(vec_splats((int32_t)Y * 2), (int32x4_p){ 0, 1, 0, 1 });
                    
int32x4_p Indexes = vec_add(
    vec_mul(vec_splats((int32_t)GridSize.X), IndexesX),
    IndexesY
);


// Somehow load 4 uint8 elements in the Canvas array
// using the first 4 Canvas + Index integers as the memory location
uint8x16_p States = ???(Indexes, (uint8_t*)Canvas);

// And then somehow index each value in States (up to the 4th item) to Palette and load as 4 uint8 values (R, G, B, A)

// maybe we could load all the values into a single vector?
uint8x16_p ColorValues = ???(States, (uint32_t*)PaletteData);
// or maybe not?
uint8x16_p C00 = ???(vec_extract(States, 0), (uint32_t*)PaletteData);
uint8x16_p C01 = ???(vec_extract(States, 0), (uint32_t*)PaletteData);
uint8x16_p C10 = ???(vec_extract(States, 0), (uint32_t*)PaletteData);
uint8x16_p C11 = ???(vec_extract(States, 0), (uint32_t*)PaletteData);

// and finally somehow average all vectors

I'd also be very thankful if anyone has some updated resources on the AltiVec intrinsics. So far here's what I have found (just in case someone finds this in the future):
https://www.ibm.com/docs/en/xl-c-aix/13.1.2?topic=functions-vector-built-in
https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.0?topic=functions-vector-built-in
https://www.nxp.com/docs/en/reference-manual/ALTIVECPIM.pdf

Upvotes: 0

Views: 50

Answers (0)

Related Questions