Reputation: 803

Optimize a bit decoding operation in C

I have an unsigned 32 bit integer encoded in the following way:

the first 6 bits define the opcode
next 8 bits define a register
next 18 bits are a two's complement signed integer value.

I am currently decoding this number (uint32_t inst) using:

const uint32_t opcode = ((inst >> 26) & 0x3F);
const uint32_t r1 = (inst >> 18) & 0xFF;
const int32_t value = ((inst >> 17) & 0x01) ? -(131072 - (inst & 0x1FFFF)) : (inst & 0x1FFFF);

I can measure a significant overhead while decoding the value and I am quite sure it is due to the ternary operator (essentially an if statment) used to compare the sign an perform the negative operation.

Is there a way to perform value decoding in a faster way?

Upvotes: 3

Answers (5)

Peter Cordes

Reputation: 364118

For ideal compiler output with no implementation-defined or undefined behaviour, use @doynax's 2's complement decoding expression:

value = (int32_t)((inst & 0x3FFFF) ^ 0x20000) - (int32_t)0x20000;

The casts make sure we're doing signed subtraction, rather than unsigned with wraparound and then assigning that bit-pattern to a signed integer.

This compiles to optimal asm on ARM, where gcc uses sbfx r1, r1, #0, #18 (signed bitfield-extract) to sign-extend bits [17:0] into a full int32_t register. On x86, it uses shl by 14 and sar by 14 (arithmetic shift) to do the same thing. This is a clear sign that gcc recognizes the 2's complement pattern and uses whatever is most optimal on the target machine to sign-extend the bitfield.

There isn't a portable way to make sure bitfields are ordered the way you want them. gcc appears to order bitfields from LSB to MSB for little-endian targets, but MSB to LSB for big-endian targets. You can use an #if to get identical asm output for ARM with/without -mbig-endian, just like the other methods, but there's no guarantee that other compilers work the same.

If gcc/clang didn't see through the xor and sub, it would be worth considering the <<14 / >>14 implementation which hand-holds the compiler towards doing it that way. Or considering the signed/unsigned bitfield approach with an #if.

But since we can get ideal asm from gcc/clang with fully safe and portable code, we should just do that.

See the code on the Godbolt Compiler Explorer, for versions from most of the answers. You can look at asm output for x86, ARM, ARM64, or PowerPC.

// have to put the results somewhere, so the function doesn't optimize away
struct decode {
  //unsigned char opcode, r1;
  unsigned int opcode, r1;
  int32_t value;
};
// in real code you might return the struct by value, but there's less ABI variation when looking at the ASM this way (some would pack the struct into registers)

void decode_two_comp_doynax(struct decode *result, uint32_t inst) {
  result->opcode = ((inst >> 26) & 0x3F);
  result->r1 = (inst >> 18) & 0xFF;
  result->value = ((inst & 0x3FFFF) ^ 0x20000) - 0x20000;
}

# clang 3.7.1 -O3 -march=haswell   (enables BMI1 bextr)
    mov     eax, esi
    shr     eax, 26                     # grab the top 6 bits with a shift
    mov     dword ptr [rdi], eax
    mov     eax, 2066          # (0x812)# only AMD provides bextr r32, r32, imm.  Intel has to set up the constant separately
    bextr   eax, esi, eax               # extract the middle bitfield
    mov     dword ptr [rdi + 4], eax
    shl     esi, 14                     # <<14
    sar     esi, 14                     # >>14 (arithmetic shift)
    mov     dword ptr [rdi + 8], esi
    ret

Upvotes: 2

John Bollinger

Reputation: 180141

Your expression is more complicated than it needs to be, especially in needlessly involving the ternary operator. The following expression computes the same results for all inputs without involving the ternary operator.^* It is a good candidate for a replacement, but as with any optimization problem, it is essential to test:

const int32_t value = (int32_t)(inst & 0x1FFFF) - (int32_t)(inst & 0x20000);

Or this variation on @doynax's suggestion along similar lines might be more optimizer-friendly:

const int32_t value = (int32_t)(inst & 0x3FFFF ^ 0x20000) - (int32_t)0x20000;

In each case, the casts avoid implementation-defined behavior; on many architectures they would be no-ops as far as the machine code is concerned. On those architectures, these expressions involve fewer operations in all cases than does yours, not to mention being unconditional.

Competitive alternatives involving shifting may also optimize well, but all such alternatives necessarily rely on implementation-defined behavior because of integer overflow of a left shift, a negative integer being the left-hand operand of a right shift, and / or converting an out-of-range value to a signed integer type. You will have to determine for yourself whether constitutes a problem.

^* as compiled by GCC 4.4.7 for x86_64. The original expression invokes implementation-defined behavior for some inputs, so on other implementations the two expressions might compute different values for those inputs.

Upvotes: 6

jxh

Reputation: 70392

You may consider using bit-fields to simplify your code.

typedef struct inst_type {
#ifdef MY_MACHINE_NEEDS_THIS
    uint32_t opcode :  6;
    uint32_t r1     :  8;
    int32_t  value  : 18;
#else
    int32_t  value  : 18;
    uint32_t r1     :  8;
    uint32_t opcode :  6;
#endif
} inst_type;

const uint32_t opcode = inst.opcode;
const uint32_t r1 = inst.r1;
const int32_t value = inst.value;

Direct bit manipulation often performs better, but not always. Using John Bollinger's answer as a baseline, the above structure results in one fewer instruction to extract the three values of interest on GCC (but fewer instructions does not necessarily mean faster).

Upvotes: 1

anatolyg

Reputation: 28241

A standard (even though non-portable) practice is a left-shift followed by an arithmetic right-shift:

const int32_t temp = inst << 14; // "shift out" the 14 unneeded bits
const int32_t value = temp >> 14; // shift the number back; sign-extend

This involves a conversion from uint32_t to int32_t and a right-shift of a possibly negative int32_t; both operations are implementation-defined, i.e. unportable (work on 2's complement systems; pretty much guaranteed to work on any architecture). If you want to gain the best performance and willing to rely on implementation-defined behavior, you can use this code.

As a single expression:

const int32_t value = (int32_t)(inst << 14) >> 14;

Note: the following looks cleaner, will also typically work, but involves undefined behavior (signed integer overflow):

const int32_t value = (int32_t)inst << 14 >> 14;

Don't use it! (even though you probably won't receive any warning or error about it).

Upvotes: 3

Teivaz

Reputation: 5665

const uint32_t opcode = ((inst >> 26) & 0x3F);
const uint32_t r1 = (inst >> 18) & 0xFF;
const uint32_t negative = ((inst >> 17) & 0x01);
const int32_t value =  -(negative * 131072 - (inst & 0x1FFFF));

when negative is 1 -(131072 - (inst & 0x1FFFF)) and for 0: -(0 - (inst & 0x1FFFF)) which is equal to inst & 0x1FFFF.

Upvotes: 0

Optimize a bit decoding operation in C

Answers (5)

Related Questions