Rafael Fontes
Rafael Fontes

Reputation: 1353

How to copy bytes from memory using pattern (YUYV packed to YUV420 planar)

Let's start with this:

I have a block of memory of 16 bytes and I need to copy only even bytes to a 8 bytes block of memory.

My current algorithm is doing something like this:

unsigned int source_size = 16, destination_size = 8, i;

unsigned char * source = new unsigned char[source_size];
unsigned char * destination = new unsigned char[destination_size];

// fill source
for( i = 0; i < source_size; ++i)
{
    source[i] = 0xf + i;
}
// source :
// 0f 10 11 12  13 14 15 16  17 18 19 1a  1b 1c 1d 1e

// copy
for( i = 0; i < destination_size; ++i)
{
    destination[i] = source[i * 2];
}
// destination :
// 0f 11 13 15  17 19 1b 1d

It's just an example, because I would like to know if there's a better method to do this when I need to get every 3rd byte or every 4th byte, not just even bytes.

I know using loop I can achieve this but I need to optmize this... I don't exactly know how to use SSE so I dont't know if it's possible to use in this case, but something like memcpy magic kinda thing would be great.

I also thought about using a macro to get rid of the loop since the size of the source and the destination are both constant, but that doesn't look like a big deal.

Maybe you can think out of the box if I say that this is to extract YCbCr bytes of a YUYV pixel format. Also I need to emphasize that I'm doing this to get rid of the libswscale.

Upvotes: 0

Views: 582

Answers (3)

stgatilov
stgatilov

Reputation: 5533

This problem can be solved efficiently with SSSE3:

#include <tmmintrin.h>  //SSSE3 and before
...
//source must be 16-byte aligned
unsigned char * source = (unsigned char *)_mm_malloc(source_size, 16);
//destination must be 8-byte aligned (that's natural anyway)
unsigned char * destination = (unsigned char *)_mm_malloc(destination_size, 8);
...
__m128i mask = _mm_set_epi8(                        //shuffling control mask (constant)
    -1, -1, -1, -1, -1, -1, -1, -1, 14, 12, 10, 8, 6, 4, 2, 0
);
__m128i reg = *(const __m128i*)source;              //load 16-bit register
__m128i comp = _mm_shuffle_epi8(reg, mask);         //do the bytes compaction
_mm_storel_epi64((__m128i*)destination, comp);      //store lower 64 bits

The convertion looks like this in generated assembly (MSVC2013):

movdqa  xmm0, XMMWORD PTR [rsi]
pshufb  xmm0, XMMWORD PTR __xmm@ffffffffffffffff0e0c0a0806040200
movq    QWORD PTR [rax], xmm0

This method should be quite fast, especially when you do many such convertions. It costs only a single shuffling instruction (not counting load/store), which seems to have 1 clock latency and 0.5 clocks throughput. Note that this approach can be used for other byte patterns too.

Upvotes: 2

rlb
rlb

Reputation: 1714

While I suspect the compiler and cpu will already be doing a great job for this case; if you really want alternatives look into techniques for reversing morton numbers. This question How to de-interleave bits (UnMortonizing?) shows how to do it on bits, but the idea can be expanded to bytes too.

Something like ( example only, this is not production quality)

// assuming destination is already zero...
For (int i=0; i < destination_size; i += 2) {
   long* pS = (long*) &source[ i * 2 ];
   long* pD = (long*) &destination[ i ];
   long a = *pS &0xff00ff00;
   *pD |= *pS | ( *pS << 8 );
}

Wheather this is faster than your version or not depends on exact cpu type and what compilers generate. ie test and see which is quicker, as mentioned by others memory fetch bottleneck will overshadow everything for small array given.

Upvotes: 2

werewindle
werewindle

Reputation: 3029

Unfortunately, you can't do this with memcpy() tricks only. Modern processors have 64 bit registers and it is the optimal size for memory transfers. Modern compilers always try optimize memcpy() calls to do 64- (or 32- or even 128-) bit transfers at a time.

But in your case you need 'strange' 24 or 16 bit transfers. It is exactly why do we have SSE, NEON and other processor extensions. And that's why they are widely used in video processing.

So in your case, you should use one of SSE optimized libs or write your own assembler code that will do this memory transfers.

Upvotes: 1

Related Questions