zzyzy
zzyzy

Reputation: 983

Intel intrinsics needed for swizzling 32-bit alpha channel

I have a 32-bit RGBA image buffer. Let's assume it's, say 1920x1080 -- typical left-to-right, top to bottom RAW buffer.

Here's what I'd like to do REALLY quickly: create two new buffers from this one source buffer...

  1. "FILL" Buffer... the RGB values match that of the original buffer. The alpha value would become opaque (0xff)
  2. "KEY" Buffer... each of the RGB values match the alpha value of the original buffer. The alpha value would be opaque (0xff)

My (slow) solution is as follows for each pixel of the input buffer:

u_int32_t pixel = *srcPtr++;  // grab the source 32-bit pixel value
*fillPtr++ = pixel | 0xff;  // FILL: keep only the RGB channels (alpha = 0xff)
pixel &= 0xff;              // KEY: grab just the alpha value
*keyPtr++ = (pixel<<24) | (pixel<<16) | (pixel<<8) | 0xff; // KEY: xfer alpha to RGB, alpha = 0xff

One can assume that the source buffer is 16-byte aligned.

Some preliminary testing has this clocking in at about 8ms on a 1920x1080 image -- Intel Xeon E5, hex-core, 12MB L3 cache, 3.5Ghz.

Can someone offer their SSE3 instrinics expertise to give this some speedup?

Upvotes: 2

Views: 510

Answers (2)

Z boson
Z boson

Reputation: 33679

In addition to Cory's answer you could try multiple threads. Even though this is memory bound using multiple threads can increase the throughput for a single socket system by up to a factor of two (and even more on a multi-socket system).

You could do something like this using OpenMP

#pragma omp parallel for
for(int i=0; i<height; i++) {
    for(int j=0; <width; j+=4) {
        split_pixels(&src[i*width+j], &fill[i*width+j], &key[i*width+j])
    }
}

Upvotes: 0

Cory Nelson
Cory Nelson

Reputation: 30031

It sounds like this is the basis of what you want -- it processes four pixels at once.

void split_pixels(__m128i src, __m128i *fill, __m128i *key)
{
    __m128i const alphamask = _mm_set_epi8(-1, 0, 0, 0, -1, 0, 0, 0,
                                           -1, 0, 0, 0, -1, 0, 0, 0);
    __m128i const fillmask = _mm_set_epi8(-1, 15, 15, 15, -1, 12, 12, 12,
                                          -1, 7, 7, 7, -1, 3, 3, 3);

    _mm_stream_si128(fill, _mm_or_si128(src, alphamask));
    _mm_stream_si128(key, _mm_or_si128(_mm_shuffle_epi8(src, fillmask), alphamask));
}

It makes use of the SSE shuffle instruction, which shuffles bytes by their index in the register. It also uses streaming stores, because you won't be able to fit three 1080p buffers in cache. Streaming stores are finicky and may or may not help depending on what else you're doing, so I'd benchmark those.

Note that this problem is highly bottlenecked by memory bandwidth, so while it might run faster than your plain C version, it will probably not run 4x faster. The more processing you can bundle before the store, the faster it'll perform.

Upvotes: 2

Related Questions