Reputation: 43
I am writing a software rasterizer with heavy use of intel intrinsics (NOT including AVX512). The colors are represented by a 32 bit unsigned, which is really just 4 packed 8 bit colors (RGBA). So, a vector of 8 colors may be held in a single __mm256 color variable. However, I need to manipulate the individual colors within this array by multiplying the individual colors by floats. In other words, I may have another vector of float/ps values, __mm256 rLight, where I want to multiply the corresponding 8 unsigned bits of R in the color vector by the float in the rLight variable. I cannot find any sane way to do this. It seems like what I need to do is extract the 8 bytes of interest into an __mm256 float array then do the multiplication, then cast back to unsigned and stick them back into the original array, but I am struggling.
Any instructions that look promising would be appreciated.
Upvotes: 1
Views: 1863
Reputation: 21936
a vector of 8 colors may be held in a single __mm256 color variable.
That's not the best way. Will be very hard to add 10+ bit color depth, or gamma correction, or color grading. For optimal performance, consider using 16 bit integers instead, or floats.
I cannot find any sane way to do this.
Convert your floats to 15 bit or 16 bit fixed point. The fastest way to do that is abusing IEEE representation, a single FMA instruction to scale+offset floats so the [0..1] range corresponds to the least significant 15-16 bits of the mantissa, then bitcast floats to integers, and subtract an int32 number bitwise equal to the float offset value. See how I did for 64-bit doubles https://github.com/Const-me/DtsDecoder/blob/7812fa32fbdc8b45e6b7dcd66aef1a58e104e089/libdcadec/interpolator_float.cpp#L135-L174 same approach can be used for 32-bit floats, it's just 2 instructions for all 8 floats in the register, _mm256_fmadd_ps and _mm256_sub_epi32.
Duplicate lanes with _mm256_packus_epi32 while compressing 32 bits into 16. Note that instruction uses saturation, will clip automagically to [0 .. 0xFFFF] i.e. you don't have to waste CPU cycles on clipping.
Load the colors.
Now it's time to scale, here's one way to do:
inline __m256i scaleBytes( __m256i rgba, __m256i mul )
{
__m256i low = _mm256_and_si256( rgba, _mm256_set1_epi16( 0xFF ) );
__m256i high = _mm256_and_si256( rgba, _mm256_set1_epi16( 0xFF00 ) );
low = _mm256_mulhi_epu16( low, mul );
high = _mm256_mulhi_epu16( high, mul );
high = _mm256_and_si256( high, _mm256_set1_epi16( 0xFF00 ) );
return _mm256_or_si256( low, high );
}
If you want better rounding, you might need to adjust the above code, the above version suffers from off-by-one error, because 0xFF * 0xFFFF = FEFF01 i.e. you'll get 0xFE after multiplying by 1.0 float. A good way to fix is use 1.15 fixed point for scalers instead of 0.16, scale floats so 1.0 maps into 0x8000, and add couple bit shift instructions to scaleBytes function. You’ll also need to clip the scaling values to 0x8000 upper bound after step 2, a single _mm256_min_epu16 instruction will do.
Update: I’ve just realized for step 1 you don’t need to scale, just offset is enough.
// Test values
__m256 floats = _mm256_setr_ps( -1, 0, 0.11f, 0.33f, 0.99f, 1, 1.11f, 12 );
// Floats have 23 bits of mantissa.
// We want [0..1] to map to the least significant 15 of them.
// Therefore, we need to offset the floats by 2 ^ ( 23 - 15 ) = 2 ^ 8
constexpr float offsetFloat = 0x1p8f;
// Same value bit-casted to integer, too bad std::bit_cast only appeared in C++/20
// https://www.h-schmidt.net/FloatConverter/IEEE754.html
constexpr int offsetInt = 0x43800000;
// Compute the integers
floats = _mm256_add_ps( floats, _mm256_set1_ps( offsetFloat ) );
const __m256i result = _mm256_sub_epi32( _mm256_castps_si256( floats ), _mm256_set1_epi32( offsetInt ) );
// Print the result
alignas( 32 ) std::array<int, 8> scalars;
_mm256_store_si256( ( __m256i * )scalars.data(), result );
for( int i : scalars )
printf( "0x%04x ", i );
printf( "\n" );
Upvotes: 3