Reputation: 83
I've got a large tightly packed array of 12-bit integers in the following repeating bit-packing pattern: (where n in An/Bn represents bit number and A and B are the first two 12-bit integers in the array)
| byte0 | byte1 | byte2 | etc..
| A11 A10 A9 A8 A7 A6 A5 A4 | B11 B10 B9 B8 B7 B6 B5 B4 | B3 B2 B1 B0 A3 A2 A1 A0 | etc..
which I'm bit reordering into the following pattern:
| byte0 | byte1 | byte2 | etc..
| A11 A10 A9 A8 A7 A6 A5 A4 | A3 A2 A1 A0 B11 B10 B9 B8 | B7 B6 B5 B4 B3 B2 B1 B0 | etc..
I have got it working in a per 3-byte loop with the following code:
void CSI2toBE12(uint8_t* pCSI2, uint8_t* pBE, uint8_t* pCSI2LineEnd)
{
while (pCSI2 < pCSI2LineEnd) {
pBE[0] = pCSI2[0];
pBE[1] = ((pCSI2[2] & 0xf) << 4) | (pCSI2[1] >> 4);
pBE[2] = ((pCSI2[1] & 0xf) << 4) | (pCSI2[2] >> 4);
// Go to next 12-bit pixel pair (3 bytes)
pCSI2 += 3;
pBE += 3;
}
}
but working with byte granularity isn't great for performance. The target CPU is a 64-bit ARM Cortex-A72 (Raspberry Pi Compute Module 4). For context, this code converts MIPI CSI-2 bit-packed raw image data to Adobe DNG's bit-packing.
I'm hoping I can get a considerable performance improvement using SIMD intrinsics but I'm not really sure where to start. I've got the SIMDe header to translate intrinsics so AVX/AVX2 solutions are welcome.
Upvotes: 3
Views: 352
Reputation: 58695
The NEON ld3
instruction is ideal for this; it loads 48 bytes and unzips them into three NEON registers. Then you just need a couple of shifts and ORs.
I came up with the following:
void vectorized(const uint8_t* pCSI2, uint8_t* pBE, const uint8_t* pCSI2LineEnd)
{
while (pCSI2 < pCSI2LineEnd) {
uint8x16x3_t in = vld3q_u8(pCSI2);
uint8x16x3_t out;
out.val[0] = in.val[0];
out.val[1] = vorrq_u8(vshlq_n_u8(in.val[2], 4), vshrq_n_u8(in.val[1], 4));
out.val[2] = vorrq_u8(vshlq_n_u8(in.val[1], 4), vshrq_n_u8(in.val[2], 4));
vst3q_u8(pBE, out);
pCSI2 += 48;
pBE += 48;
}
}
With gcc, the generated assembly looks like what you would expect. (There is one mov
that could be eliminated with better register allocation, but that's pretty minor.)
Unfortunately clang has what looks like a bizarre missed optimization, where it breaks the 4-bit right shift into a 3-bit and a 1-bit shift. I filed a bug.
In principle we can do a little better using sli
, Shift Left and Insert, to effectively merge the OR with one of the shifts:
out.val[1] = vsliq_n_u8(vshrq_n_u8(in.val[1], 4), in.val[2], 4);
out.val[2] = vsliq_n_u8(vshrq_n_u8(in.val[2], 4), in.val[1], 4);
But since it overwrites its source operand, we pay for it with a couple extra mov
s. https://godbolt.org/z/TbzEEd1Pn. clang allocates registers more cleverly and only needs one mov
.
Another option, which could be slightly faster, is to use sra
, Shift Right and Accumulate, which does an add instead of an insert. Since the relevant bits are already zero here, this has the same effect. Oddly there is no sla
.
out.val[1] = vsraq_n_u8(vshlq_n_u8(in.val[2], 4), in.val[1], 4);
out.val[2] = vsraq_n_u8(vshlq_n_u8(in.val[1], 4), in.val[2], 4);
Upvotes: 5
Reputation: 28300
I suggest you start with a diagram.
I can't say about NEON, so I'll describe how I would make AVX2 code which does what you want (however, you should implement it with your target instruction set; better don't bother with converters, if your goal is to make new code). x64 intrinsics have great documentation; here is an example which I use.
AVX2 registers have 256 bits, or 32 bytes. That is, 10 units of your 24-bit data. Make a diagram (on paper would be best for me): draw which bits would a 256-bit register contain if you read it from memory. Then draw which bits you want to get in it after your transformation. Connect them with lines. Identify blocks of bits which have identical relative positions.
Then write code which isolates relevant blocks of bits (_mm256_and_si256
), shifts them around (_mm256_slli_si256
, possibly _mm256_bslli_epi128
or others) and combines them (_mm256_or_si256
). AVX2 is particularly idiosyncratic about shifts, so I am sure NEON code will be easier to write.
Your main loop should probably contain reading, processing and writing 3 registers, or 768 bits. If you make a diagram for just the first one, you might be able to implement the other two similarly. Of course, you need special treatment for loop leftovers (the last few data elements) — use regular C code for them.
Upvotes: 0