Reputation: 91

How to combine two 16bit-words into one 32bit-word bit by bit efficiently?

I have to combine two 16bit-words into a 32bit-word several hundreds times, which takes a lot of computation power. I would like to find out a more efficient way to do this.

I have 2 16bit-words named A and B. I want to have a 32bit-word named C. The bits in A should be copied to the even number bits in C. The bits in B should be copied to the odd number bits in C. For example: A: 0b0000000000000000 B:0b1111111111111111 The processed C should be 0b10101010101010101010101010101010.

My current solution looks like this:

for (i = 0; i < 32; i+=2)
{
    C |=  (A & (1 << (i/2))) << (i/2);
    C |=  (B & (1 << (i/2))) << (i/2 + 1);
}

This solution takes too much time when I have several hundreds of C to deal with. I am looking for a better one!

Added: This program runs on TriCore. I have no choice but to deal with the data in this way because this relation between AB and C is defined by the protocol.

Thank you!

Upvotes: 5

Answers (7)

Clifford

Reputation: 93564

The following uses two walking-one masks one for testing the source data bits and one for masking in to the destination. Testing at compileonline.com for 10 million iterations gave the following results:

Original algorithm: 1.14 seconds
This algorithm: 0.81 seconds

though don't stop reading - there are dramatic improvements to follow.

    uint32_t C ;
    uint16_t srcmask ;
    uint32_t dstmask ;

    for( C = 0, srcmask = 1u, dstmask = 1u; 
         srcmask != 0; 
         srcmask <<= 1 )
    {
        if( (A & srcmask) != 0 )
        {
            C |= dstmask ;
        }
        dstmask <<= 1 ;

        if( (B & srcmask) != 0 )
        {
            C |= dstmask ;
        }
        dstmask <<= 1 ;
    }

In theory however, the performance may vary depending on the number of 1 bits, but in my test this difference was not measurable, but a different target and compiler may yield different results.

Unrolling the loop to 4 source bits per iteration has a marginal benefit (0.77 seconds):

    for( C = 0, srcmask = 1u, dstmask = 1u; 
         srcmask != 0; 
         srcmask <<= 1 )
    {
        // Unroll 1
        if( (A & srcmask) )
        {
            C |= dstmask ;
        }
        dstmask <<= 1 ;

        if( (B & srcmask) )
        {
            C |= dstmask ;
        }
        dstmask <<= 1 ;

        // Unroll 2
        srcmask <<= 1 ;
        if( (A & srcmask) )
        {
            C |= dstmask ;
        }
        dstmask <<= 1 ;

        if( (B & srcmask) )
        {
            C |= dstmask ;
        }
        dstmask <<= 1 ;

        // Unroll 3
        srcmask <<= 1 ;
        if( (A & srcmask) )
        {
            C |= dstmask ;
        }
        dstmask <<= 1 ;

        if( (B & srcmask) )
        {
            C |= dstmask ;
        }
        dstmask <<= 1 ;

        // Unroll 4
        srcmask <<= 1 ;
        if( (A & srcmask) )
        {
            C |= dstmask ;
        }
        dstmask <<= 1 ;

        if( (B & srcmask) )
        {
            C |= dstmask ;
        }
        dstmask <<= 1 ;
    }

Further unrolling had a detrimental effect, but again target and compiler results may vary.

I then declared C, srcmask and dstmask as register, without expecting any difference:

register uint32_t C ;
register uint16_t srcmask ;
register uint32_t dstmask ;

I was astounded at the result:

Original algorithm: 1.19 seconds
This algorithm: 0.29 seconds

The effect of the unrolling was significant here - without it the time went to 0.45 seconds, and 2x unroll = 0.33 seconds. Further unrolling had minimal effect. Declaring A and B as register reduced performance slightly - there are only so many registers to go around!. Again YMMV.

The conclusion must be therefore that you need to experiment with a number of techniques to determine what works best on your target. Here a combination of better algorithm, loop-unrolling and register variables had a dramatic impact. Experimentation with different compiler optimisation settings may also have an impact, though what improves one area of code may be to the detriment of others, so you may not want to apply the same optimisations to all code.

Upvotes: 0

Aki Suihkonen

Reputation: 20037

This problem is also called 'Morton number encoding'; i.e. flattening 2-D or 3-D coordinates to a single number.

This blog entry summarizes three typical methods: naïve for loop, magic bits (as in chux's answer) and Look Up Table. LUT based approach was the clear winner.

One has to choose basically how many bits to process at a time. Typically the sweet spot is in 8->16 bit or 4->8 bit LUT, such as here.

0001 --> 0 0 0 0 0 0 0 1
0010 --> 0 0 0 0 0 1 0 0
0011 --> 0 0 0 0 0 1 0 1  etc.

To expand two uint8_t variables using this table is achieved with the formula:

uint16_t ans =  LUT[a & 15]       + (LUT[b & 15] << 1) +
               (LUT[a >> 4] << 8) + (LUT[b << 4] << 9);

Again, one has to profile if it's more efficient with the given number of bits to have 4 distinct tables, each shifting left with a constant, or perform the shift manually.

Upvotes: 0

chux

Reputation: 154075

Rather than a loop, shift in groups.

Some further simplifications possible, but below is the gist of it. Is it faster on average (or worst-case)? Profile to find out.

#include <inttypes.h>
#include <stdint.h>

uint64_t Merge(uint32_t a, uint32_t b) {
  uint64_t A,B;
  A = ((a & 0x00000000FFFF0000ull) << 16) | (a & 0x000000000000FFFFull);
  A = ((A & 0x0000FF000000FF00ull) <<  8) | (A & 0x000000FF000000FFull);
  A = ((A & 0xF0F0F0F0F0F0F0F0ull) <<  4) | (A & 0x0F0F0F0F0F0F0F0Full);
  A = ((A & 0xCCCCCCCCCCCCCCCCull) <<  2) | (A & 0x0333333333333333ull);
  A = ((A & 0xAAAAAAAAAAAAAAAAull) <<  1) | (A & 0x5555555555555555ull);

  B = ((b & 0x00000000FFFF0000ull) << 16) | (b & 0x000000000000FFFFull);
  B = ((B & 0x0000FF000000FF00ull) <<  8) | (B & 0x000000FF000000FFull);
  B = ((B & 0xF0F0F0F0F0F0F0F0ull) <<  4) | (B & 0x0F0F0F0F0F0F0F0Full);
  B = ((B & 0xCCCCCCCCCCCCCCCCull) <<  2) | (B & 0x0333333333333333ull);
  B = ((B & 0xAAAAAAAAAAAAAAAAull) <<  1) | (B & 0x5555555555555555ull);

  return A | (B << 1);
}

void MergeTest(uint32_t a, uint32_t b) {
  uint64_t C = Merge(a,b);
  printf("a:%08" PRIX32 " b:%08" PRIX32 " c:%016" PRIX64 "\n", a,b,C);
}

void MergeTests(void) {
  MergeTest(0x00000000L, 0xFFFFFFFFL);
  MergeTest(0xFFFFFFFFL, 0x00000000L);
  MergeTest(0x00000000L, 0x00000001L);;
  MergeTest(0x00000000L, 0x00000010L);;
}

a:00000000 b:FFFFFFFF c:AAAAAAAAAAAAAAAA  
a:FFFFFFFF b:00000000 c:5555555555555555  
a:00000000 b:00000001 c:0000000000000002  
a:00000000 b:00000010 c:0000000000000200

Upvotes: 1

Thierry

Reputation: 41

Seems to be 40% faster but it's really depend of compiler optimizations ;-)

for (i=1, j=2, msk=1; i<0x100000000; i<<=2, j<<=2, msk<<=1) {
    if (A & msk) C |= i;
    if (B & msk) C |= j;
}

Upvotes: 0

Chris Dodd

Reputation: 126418

Turns out Tricore has a BMERGE instruction that does precisely what you want -- it takes two 16-bit values and interleaves the bits. If you're using the gcc-based toolchain, you should be able to use a single inline asm statement -- something like:

asm("bmerge %0,%1,%2" : "=r"(C) : "r"(A), "r"(B))

There's also a BSPLIT instruction that does the reverse.

Upvotes: 6

Ben Jackson

Reputation: 93880

The most likely type of solution to work on an MCU (which might be 8-bit and probably doesn't have a barrel shifter) is hand-coded assembly along these lines (taking A, B, and CL/CH as 16-bit registers):

LOOP:
  MOV CNT, 16
  RRC A     ; rotate A right through the carry
  RRC CH    ; carry enters C at the top
  RRC CL    ; continue roll through CL
  RRC B
  RRC CH
  RRC CL
  DJNZ CNT,LOOP

(Obviously each RRC becomes two if the MCU is 8-bit).

This solution "shuffles" the bits together while only rotating one bit per cycle, which any MCU can do. You can try to write this in C but you'll need a very good optimizer to produce this sequence of instructions from something like lsb = A & 1; A >>= 1; C >>=1; C |= lsb << 31;

EDIT: With a 32-bit CPU you could consider all of the options listed at Bit Twiddling Hacks.

Upvotes: 0

Jabberwocky

Reputation: 50831

Try this :

for (i = 0; i < 32; i+=2)
{
    int i2 = i >> 1 ;
    int andval = 1 << i2 ;
    C |=  (A & andval) << i2;
    C |=  (B & andval) << (i2 + 1);
}

but your compiler has probably done this optimisation already.

Upvotes: 0

How to combine two 16bit-words into one 32bit-word bit by bit efficiently?

Answers (7)

Related Questions