Reputation: 983
I have a 32-bit RGBA image buffer. Let's assume it's, say 1920x1080 -- typical left-to-right, top to bottom RAW buffer.
Here's what I'd like to do REALLY quickly: create two new buffers from this one source buffer...
My (slow) solution is as follows for each pixel of the input buffer:
u_int32_t pixel = *srcPtr++; // grab the source 32-bit pixel value
*fillPtr++ = pixel | 0xff; // FILL: keep only the RGB channels (alpha = 0xff)
pixel &= 0xff; // KEY: grab just the alpha value
*keyPtr++ = (pixel<<24) | (pixel<<16) | (pixel<<8) | 0xff; // KEY: xfer alpha to RGB, alpha = 0xff
One can assume that the source buffer is 16-byte aligned.
Some preliminary testing has this clocking in at about 8ms on a 1920x1080 image -- Intel Xeon E5, hex-core, 12MB L3 cache, 3.5Ghz.
Can someone offer their SSE3 instrinics expertise to give this some speedup?
Upvotes: 2
Views: 510
Reputation: 33679
In addition to Cory's answer you could try multiple threads. Even though this is memory bound using multiple threads can increase the throughput for a single socket system by up to a factor of two (and even more on a multi-socket system).
You could do something like this using OpenMP
#pragma omp parallel for
for(int i=0; i<height; i++) {
for(int j=0; <width; j+=4) {
split_pixels(&src[i*width+j], &fill[i*width+j], &key[i*width+j])
}
}
Upvotes: 0
Reputation: 30031
It sounds like this is the basis of what you want -- it processes four pixels at once.
void split_pixels(__m128i src, __m128i *fill, __m128i *key)
{
__m128i const alphamask = _mm_set_epi8(-1, 0, 0, 0, -1, 0, 0, 0,
-1, 0, 0, 0, -1, 0, 0, 0);
__m128i const fillmask = _mm_set_epi8(-1, 15, 15, 15, -1, 12, 12, 12,
-1, 7, 7, 7, -1, 3, 3, 3);
_mm_stream_si128(fill, _mm_or_si128(src, alphamask));
_mm_stream_si128(key, _mm_or_si128(_mm_shuffle_epi8(src, fillmask), alphamask));
}
It makes use of the SSE shuffle instruction, which shuffles bytes by their index in the register. It also uses streaming stores, because you won't be able to fit three 1080p buffers in cache. Streaming stores are finicky and may or may not help depending on what else you're doing, so I'd benchmark those.
Note that this problem is highly bottlenecked by memory bandwidth, so while it might run faster than your plain C version, it will probably not run 4x faster. The more processing you can bundle before the store, the faster it'll perform.
Upvotes: 2