Reputation: 363
>>> x = -4
>>> print("{} {:b}".format(x, x))
-4 -100
>>> mask = 0xFFFFFFFF
>>> print("{} {:b}".format(x & mask, x & mask))
4294967292 11111111111111111111111111111100
>>>
>>> x = 0b11111111111111111111111111111100
>>> print("{} {:b}".format(x, x))
4294967292 11111111111111111111111111111100
>>> print("{} {:b}".format(~(x ^ mask), ~(x ^ mask)))
-4 -100
I am having trouble figuring out how Python represents negative integers, and therefore how bit operations work. It is my understanding that Python attempts to emulate two's complement, but with any number of bits. Therefore, it is common to use 32-bit masks to force Python to set a standard size on integers before bit operations.
As you can see in my example, -4 & 0xFFFFFFFF
yields a large positive number. Why does Python seem to read this as an unsigned integer, instead of a two's complement negative number? Later, the operation ~(x ^ mask)
, which should yield the exact same two's complement bit pattern as the large positive, instead gives -4
. What causes the conversion to a signed int?
Thanks!
Upvotes: 24
Views: 15860
Reputation: 32904
Python doesn't specify a internal number representation for signed integers (like two's complement, sign-magnitude, etc.). Different implementations of Python like CPython, IronPython, PyPy, etc. can do different things. Hence going with the manual is a better bet as it's a language-level guarantee than depending on one particular implementation's behaviour.
The answer to both your questions is quite simple when you read the manual § Bitwise Operations on Integer Types (emphasis mine):
The result of bitwise operations is calculated as though carried out in two’s complement with an infinite number of sign bits.
Two important rules of converting two's complement to decimal before we proceed:
4 = 0b0..0_0100 # Notice infinite 0s preceding as 4 is +ve
~4 = 0b1..1_1011 # Notice infinite 1s preceding (MSB set)
We can't work with infinity easily, let's assume a 4-bit representation. So ~4
is 0b1011
; this is -5
in two's complement. Why? Recall rules of working with two's complement above: negative as MSB is set, magnitude = two's complement(0b1011
) = 100 + 1 = 101 = 5
. -5
stays the same even with a wider type e.g. as 8-bit 0b1111_1011
two's complement is -5
. This works at any widths wider than 4's binary representation.
-4 & 0xFFFFFFFF
yields a large positive number. Why does Python seem to read this as an unsigned integer, instead of a two's complement negative number?
Python does represent -4
as two's complement but with infinite sign bit extension. -4
in two's complement 8-bit representation: 0b1111_1100
. Let's do the bitwise AND
remembering our infinite sign bit extension:
0b1..1_1111_1111_1111_1111_1111_1111_1111_1100 # -4 two's complement in ∞-bit repr
0b0..0_1111_1111_1111_1111_1111_1111_1111_1111 # 0xffff_ffff in ∞-bit repr
#---------------------------------------------- &
0b0..0_1111_1111_1111_1111_1111_1111_1111_1100
#----------------------------------------------
The result of -4 & 0xffff_ffff
is the infinite sign bits preceding are cut off by the mask: you get 4294967292
.
the operation
~(x ^ mask)
, which should yield the exact same two's complement bit pattern as the large positive, instead gives-4
. What causes the conversion to a signed int?
x ^ mask
= 0b1..100 ^ 0b1..111
= 3
. ~3 = -4
due to the same rules I've explained above.
In Python bitwise complement is going to flip the sign as the MSB too gets flipped. Bitwise AND with a mask to always get a positive value; to force it into a positive value irrespective of the operand's sign.
Upvotes: 0
Reputation: 8180
TLDR; CPython integer type stores the sign in a specific field of a structure. When performing a bitwise operation, CPython replaces negative numbers by their two's complement and sometimes (!) performs the reverse operation (ie replace the two's complements by negative numbers).
The internal representation of an integer is a PyLongObject
struct, that contains a PyVarObject
struct. (When CPython creates a new PyLong
object, it allocates the memory for the structure and a trailing space for the digits.) What matter here is that the PyLong
is sized: the ob_size
field of the PyVarObject
embedded struct contains the size (in digits) of the integer (digits are either 15 or 30 bits digits).
If the integer is negative then this size is minus the number of digits .
(References: https://github.com/python/cpython/blob/master/Include/object.h and https://github.com/python/cpython/blob/master/Include/longobject.h)
As you see, the inner CPython's representation of an integer is really far from the usual binary representation. Yet CPython has to provide bitwise operations for various purposes. Let's take a look at the comments in the code:
static PyObject *
long_bitwise(PyLongObject *a,
char op, /* '&', '|', '^' */
PyLongObject *b)
{
/* Bitwise operations for negative numbers operate as though
on a two's complement representation. So convert arguments
from sign-magnitude to two's complement, and convert the
result back to sign-magnitude at the end. */
/* If a is negative, replace it by its two's complement. */
/* Same for b. */
/* Complement result if negative. */
}
To handle negative integers in bitwise operations, CPython use the two's complement (actually, that's a two's complement digit by digit, but I don't go into the details). But note the "Sign Rule" (name is mine): the sign of the result is the bitwise operator applied to the signs of the numbers. More precisely, the result is negative if nega <op> negb == 1
, (negx
= 1
for negative, 0
for positive). Simplified code:
switch (op) {
case '^': negz = nega ^ negb; break;
case '&': negz = nega & negb; break;
case '|': negz = nega | negb; break;
default: ...
}
On the other hand, the formatter does not perform the two's complement, even in binary representation: format_long_internal calls long_format_binary and remove the two leading characters, but keeps the sign. See the code:
/* Is a sign character present in the output? If so, remember it
and skip it */
if (PyUnicode_READ_CHAR(tmp, inumeric_chars) == '-') {
sign_char = '-';
++prefix;
++leading_chars_to_skip;
}
The long_format_binary
function does not perform any two's complement: just output the number in base 2, preceded by the sign.
if (negative) \
*--p = '-'; \
I will follow your REPL sequence:
>>> x = -4
>>> print("{} {:b}".format(x, x))
-4 -100
Nothing surprising, given that there is no two's complement in formatting, but a sign.
>>> mask = 0xFFFFFFFF
>>> print("{} {:b}".format(x & mask, x & mask))
4294967292 11111111111111111111111111111100
The number -4
is negative. Hence, it is replaced by its two's complement before the logical and, digit by digit. You expected that the result will be turned into a negative number, but remember the "Sign Rule":
>>> nega=1; negb=0
>>> nega & negb
0
Hence: 1. the result does not have the negative sign; 2. the result is not complemented to two. Your result is compliant with the "Sign Rule", even if this rule doesn't seem very intuitive.
Now, the last part:
>>> x = 0b11111111111111111111111111111100
>>> print("{} {:b}".format(x, x))
4294967292 11111111111111111111111111111100
>>> print("{} {:b}".format(~(x ^ mask), ~(x ^ mask)))
-4 -100
Again, -4
is negative, hence replaced by it's two's complement 0b11111111111111111111111111111100
, then XORed with 0b11111111111111111111111111111111
. The result is 0b11
(3
). You take the complement unary, that is 0b11111111111111111111111111111100
again, but this time the sign is negative:
>>> nega=1; negb=0
>>> nega ^ negb
1
Therefore, the result is complemented and gets the negative sign, as you expected.
Conclusion: I guess there was no perfect solution to have arbitrary long signed number and provide bitwise operations, but the documentation is not really verbose on the choices that were made.
Upvotes: 18