Reputation: 91
I have to combine two 16bit-words into a 32bit-word several hundreds times, which takes a lot of computation power. I would like to find out a more efficient way to do this.
I have 2 16bit-words named A and B. I want to have a 32bit-word named C. The bits in A should be copied to the even number bits in C. The bits in B should be copied to the odd number bits in C. For example: A: 0b0000000000000000 B:0b1111111111111111 The processed C should be 0b10101010101010101010101010101010.
My current solution looks like this:
for (i = 0; i < 32; i+=2)
{
C |= (A & (1 << (i/2))) << (i/2);
C |= (B & (1 << (i/2))) << (i/2 + 1);
}
This solution takes too much time when I have several hundreds of C to deal with. I am looking for a better one!
Added: This program runs on TriCore. I have no choice but to deal with the data in this way because this relation between AB and C is defined by the protocol.
Thank you!
Upvotes: 5
Views: 2473
Reputation: 93564
The following uses two walking-one masks one for testing the source data bits and one for masking in to the destination. Testing at compileonline.com for 10 million iterations gave the following results:
though don't stop reading - there are dramatic improvements to follow.
uint32_t C ;
uint16_t srcmask ;
uint32_t dstmask ;
for( C = 0, srcmask = 1u, dstmask = 1u;
srcmask != 0;
srcmask <<= 1 )
{
if( (A & srcmask) != 0 )
{
C |= dstmask ;
}
dstmask <<= 1 ;
if( (B & srcmask) != 0 )
{
C |= dstmask ;
}
dstmask <<= 1 ;
}
In theory however, the performance may vary depending on the number of 1 bits, but in my test this difference was not measurable, but a different target and compiler may yield different results.
Unrolling the loop to 4 source bits per iteration has a marginal benefit (0.77 seconds):
for( C = 0, srcmask = 1u, dstmask = 1u;
srcmask != 0;
srcmask <<= 1 )
{
// Unroll 1
if( (A & srcmask) )
{
C |= dstmask ;
}
dstmask <<= 1 ;
if( (B & srcmask) )
{
C |= dstmask ;
}
dstmask <<= 1 ;
// Unroll 2
srcmask <<= 1 ;
if( (A & srcmask) )
{
C |= dstmask ;
}
dstmask <<= 1 ;
if( (B & srcmask) )
{
C |= dstmask ;
}
dstmask <<= 1 ;
// Unroll 3
srcmask <<= 1 ;
if( (A & srcmask) )
{
C |= dstmask ;
}
dstmask <<= 1 ;
if( (B & srcmask) )
{
C |= dstmask ;
}
dstmask <<= 1 ;
// Unroll 4
srcmask <<= 1 ;
if( (A & srcmask) )
{
C |= dstmask ;
}
dstmask <<= 1 ;
if( (B & srcmask) )
{
C |= dstmask ;
}
dstmask <<= 1 ;
}
Further unrolling had a detrimental effect, but again target and compiler results may vary.
I then declared C
, srcmask
and dstmask
as register
, without expecting any difference:
register uint32_t C ;
register uint16_t srcmask ;
register uint32_t dstmask ;
I was astounded at the result:
The effect of the unrolling was significant here - without it the time went to 0.45 seconds, and 2x unroll = 0.33 seconds. Further unrolling had minimal effect. Declaring A and B as register reduced performance slightly - there are only so many registers to go around!. Again YMMV.
The conclusion must be therefore that you need to experiment with a number of techniques to determine what works best on your target. Here a combination of better algorithm, loop-unrolling and register variables had a dramatic impact. Experimentation with different compiler optimisation settings may also have an impact, though what improves one area of code may be to the detriment of others, so you may not want to apply the same optimisations to all code.
Upvotes: 0
Reputation: 20037
This problem is also called 'Morton number encoding'; i.e. flattening 2-D or 3-D coordinates to a single number.
This blog entry summarizes three typical methods: naïve for loop, magic bits (as in chux's answer) and Look Up Table. LUT based approach was the clear winner.
One has to choose basically how many bits to process at a time. Typically the sweet spot is in 8->16 bit or 4->8 bit LUT, such as here.
0001 --> 0 0 0 0 0 0 0 1
0010 --> 0 0 0 0 0 1 0 0
0011 --> 0 0 0 0 0 1 0 1 etc.
To expand two uint8_t variables using this table is achieved with the formula:
uint16_t ans = LUT[a & 15] + (LUT[b & 15] << 1) +
(LUT[a >> 4] << 8) + (LUT[b << 4] << 9);
Again, one has to profile if it's more efficient with the given number of bits to have 4 distinct tables, each shifting left with a constant, or perform the shift manually.
Upvotes: 0
Reputation: 154075
Rather than a loop, shift in groups.
Some further simplifications possible, but below is the gist of it. Is it faster on average (or worst-case)? Profile to find out.
#include <inttypes.h>
#include <stdint.h>
uint64_t Merge(uint32_t a, uint32_t b) {
uint64_t A,B;
A = ((a & 0x00000000FFFF0000ull) << 16) | (a & 0x000000000000FFFFull);
A = ((A & 0x0000FF000000FF00ull) << 8) | (A & 0x000000FF000000FFull);
A = ((A & 0xF0F0F0F0F0F0F0F0ull) << 4) | (A & 0x0F0F0F0F0F0F0F0Full);
A = ((A & 0xCCCCCCCCCCCCCCCCull) << 2) | (A & 0x0333333333333333ull);
A = ((A & 0xAAAAAAAAAAAAAAAAull) << 1) | (A & 0x5555555555555555ull);
B = ((b & 0x00000000FFFF0000ull) << 16) | (b & 0x000000000000FFFFull);
B = ((B & 0x0000FF000000FF00ull) << 8) | (B & 0x000000FF000000FFull);
B = ((B & 0xF0F0F0F0F0F0F0F0ull) << 4) | (B & 0x0F0F0F0F0F0F0F0Full);
B = ((B & 0xCCCCCCCCCCCCCCCCull) << 2) | (B & 0x0333333333333333ull);
B = ((B & 0xAAAAAAAAAAAAAAAAull) << 1) | (B & 0x5555555555555555ull);
return A | (B << 1);
}
void MergeTest(uint32_t a, uint32_t b) {
uint64_t C = Merge(a,b);
printf("a:%08" PRIX32 " b:%08" PRIX32 " c:%016" PRIX64 "\n", a,b,C);
}
void MergeTests(void) {
MergeTest(0x00000000L, 0xFFFFFFFFL);
MergeTest(0xFFFFFFFFL, 0x00000000L);
MergeTest(0x00000000L, 0x00000001L);;
MergeTest(0x00000000L, 0x00000010L);;
}
a:00000000 b:FFFFFFFF c:AAAAAAAAAAAAAAAA
a:FFFFFFFF b:00000000 c:5555555555555555
a:00000000 b:00000001 c:0000000000000002
a:00000000 b:00000010 c:0000000000000200
Upvotes: 1
Reputation: 41
Seems to be 40% faster but it's really depend of compiler optimizations ;-)
for (i=1, j=2, msk=1; i<0x100000000; i<<=2, j<<=2, msk<<=1) {
if (A & msk) C |= i;
if (B & msk) C |= j;
}
Upvotes: 0
Reputation: 126418
Turns out Tricore has a BMERGE
instruction that does precisely what you want -- it takes two 16-bit values and interleaves the bits. If you're using the gcc-based toolchain, you should be able to use a single inline asm statement -- something like:
asm("bmerge %0,%1,%2" : "=r"(C) : "r"(A), "r"(B))
There's also a BSPLIT
instruction that does the reverse.
Upvotes: 6
Reputation: 93880
The most likely type of solution to work on an MCU (which might be 8-bit and probably doesn't have a barrel shifter) is hand-coded assembly along these lines (taking A
, B
, and CL
/CH
as 16-bit registers):
LOOP:
MOV CNT, 16
RRC A ; rotate A right through the carry
RRC CH ; carry enters C at the top
RRC CL ; continue roll through CL
RRC B
RRC CH
RRC CL
DJNZ CNT,LOOP
(Obviously each RRC
becomes two if the MCU is 8-bit).
This solution "shuffles" the bits together while only rotating one bit per cycle, which any MCU can do. You can try to write this in C but you'll need a very good optimizer to produce this sequence of instructions from something like lsb = A & 1; A >>= 1; C >>=1; C |= lsb << 31;
EDIT: With a 32-bit CPU you could consider all of the options listed at Bit Twiddling Hacks.
Upvotes: 0
Reputation: 50831
Try this :
for (i = 0; i < 32; i+=2)
{
int i2 = i >> 1 ;
int andval = 1 << i2 ;
C |= (A & andval) << i2;
C |= (B & andval) << (i2 + 1);
}
but your compiler has probably done this optimisation already.
Upvotes: 0