Larry
Larry

Reputation: 1765

bitwise - how are the bitmasks operations implemented?

Context

I am using a lot of bitwise operations but I don't even know how they are implemented at the lowest level possible.

I would like to see how the intel/amd devs achieve to implement such operations. Not to replace them in my code, that would be dumb.. But to get a broader catch on what is going on.

I tried to find some info but most of the time, people ask about its use or to replace it with other bitwise operations, which is not the case here.

Questions

Is it doing basic iterations in assembly(sse) over the 32 bits and compare ?

Are there some tricks to get it up to speed ?

Thanks

Upvotes: 3

Views: 997

Answers (4)

mrjoltcola
mrjoltcola

Reputation: 20852

Most all are implemented directly on the CPU, as basic, native instructions, not part of SSE. These are the oldest, most basic operations on the CPU register.

As to how and, or, xor, etc. are implemented, if you are really interested, look up digital logic design, or discrete math. Lookup up Flip-flops, AND gates, or NAND / NOR / XOR gates

https://en.wikipedia.org/wiki/NAND_logic

Also lookup K-maps (Karnaugh maps), these are what you can use to implement a logic circuit by hand.

https://en.wikipedia.org/wiki/Karnaugh_map

If you really enjoy the reading, you can signup for a digital logic design class if you have access to an engineering or computer science university. You will get to build logic circuits with large ICs on a breadboard, but nowadays most CPUs are "written" with code, like software, and "printed" on a silicon wafer.

Of particular interest is NAND and NOR due to their functional completeness (you can use NAND or NOR to construct any truth table).

NAND (logical symbol looks like =Do-)

A
  =Do- Q    is Q = NOT(A AND B)
B

Truth table
A    B     Q
0    0     1
0    1     1
1    0     1
1    1     0

You can rewrite any logic with NAND.

As you can also see, its pretty efficient, you can't get any lower level than a single gate with binary (though there is ternary / tri-state logic), so its a single clock state change. So for a 64-bit CPU register, you'll need 64 of these babies side by side, PER register... PER core... PER instruction. And that is only the "logical" registers. Because advanced processors (like Intel Core) do register renaming, you have more physical registers in silicon than logically available to you by name.

Upvotes: 11

Kaganar
Kaganar

Reputation: 6570

As you know, data in computers is represented in a binary format.

For example, if you have the integer 13 it's represented as 1101b (where b means binary). This works out to (1) * 8 + (1) * 4 + (0) * 2 + (1) * 1 = 13, just like (1) * 10 + (3) * 1 = 13 -- different bases.

However, for basic operations computers need to know how much data you're working with. A typical integer size is 32 bits. So it's not just 1101b, it's 00000000000000000000000000001101b -- 32 bits, most of them unused.

Bitwise operations are just that -- they operate only on a bit level. Adding, multiplying, and other operations consider multiple bits at a time to perform their function, but bitwise operators do not. For example:

What's 12 bitwise-and 7? (in C vernacular, 12 & 7)

1010b 12 &
0111b  7
-----  =
0010n  2

Why? Think vertically! Look at the left set of digits -- 1 and 0 is 0. Then, 0 and 1 is 0. Then, 1 and 1 is 1. Finally, 0 and 1 is 0.

This is based on the and truth table that states these rules -- that only true (aka 1) and true (aka 1) results in false (aka 0). All other resultant values are false (aka 0).

Likewise, the or truth table states that all results are true (aka 1) except for false (aka 0) and false (aka 0) which results in false (aka 0).

Let's do the same example, but this time let's computer 12 bitwise-or 7. (Or in C vernacular, 12 | 7)

1010b 12 |
0111b  7
-----  =
1111n 15

And finally, let's consider one other principal bitwise operator: not. This is a unary operator where you simply flip each bit. Let's compute bitwise-not 7 (or in C vernacular, ~7)

0111b ~7
-----  =
1000b  8

But wait.. What about all those leading zeroes? Well, yes, before I was omitting them because they weren't important, but now they surely are:

00000000000000000000000000000111b ~7
---------------------------------  =
11111111111111111111111111111000b  ... big number?

If you're instructing the computer to treat the result as an unsigned integer (32-bit), that's a really big number. (Little less than 4 billion). If you're instructing the computer to treat the result as a signed integer (32-bit) that's -8.

As you may have guessed, since the logic is really quite simple for all these operations, there's not much you can do to make them individually faster. However, bitwise operations obey the same logic as boolean logic, and thus you can use boolean logic reduction techniques to reduce the number of bitwise operations you may need.

e.g. (A & B) | (A & C) results in the same as A & (B | C)

However, that's a much larger topic. Karnaugh maps are one technique, but boolean algebra is usually what I end up using while programming.

Upvotes: 0

Lee Daniel Crocker
Lee Daniel Crocker

Reputation: 13196

AND, OR, XOR, and NOT operations are implemented quite efficiently in silicon, and so are generally a single-cycle native instruction on most processors. That is, for a 16-bit processor, whole 16-bit registers are ANDed at once; on a 32-bit processor, 32 bits at once, etc. The only performance issue you might want to be aware of is alignment: on an ARM processor, for example, if a 32-bit value starts at a memory address that is a multiple of 4, then a read-modify-write can be done in two or three cycles. If it's at an odd address, it has to do two reads at the neighboring aligned addresses and two writes, and so is slower.

Bit shifting in some older processors may involve looping over single shifts. That is, 1 << 5 will take longer than 1 << 2. But most modern processors have what is called a "barrel shifter" that equalizes all shifts up to the register size, so on a Pentium, 1 << 31 takes no longer than 1 << 2.

Addition and subtraction are fast primitives as well. Multiplication and division are tricky: these are mostly implemented as microcode loops. Multiplication can be sped up by unrolling the loops into huge swaths of silicon in a high-end processor, but division cannot, so generally division is the slowest basic operation in a microprocessor.

Upvotes: 5

hcs
hcs

Reputation: 1534

Bitwise operations are what processors are made of, so it is natural to expose those operations with instructions. Operations like AND, OR, XOR, NOR, NAND and NOT can be performed by the ALU with only a few logic gates per bit. Importantly, each bit of the result only relies on two bits of the input (unlike multiplication or addition), so the entire operation can proceed in parallel without any complication.

Upvotes: 3

Related Questions