Arnav
Arnav

Reputation: 609

Short Integers in Python

Python allocates integers automatically based on the underlying system architecture. Unfortunately I have a huge dataset which needs to be fully loaded into memory.

So, is there a way to force Python to use only 2 bytes for some integers (equivalent of C++ 'short')?

Upvotes: 32

Views: 70598

Answers (7)

Alezarin
Alezarin

Reputation: 1

You can make an int into a bunch of smaller ints, then access specific bits from them:

n = 4532  # '0b1000110110100'
mask = 0b000011110000  # We want to access the middle 4 bits
mid = ((n & mask) << 4)  # Leave only specified data and move places back

For putting data in, first use the same mask to blank the part of the int you need to use, then bitshift the new data into position then 'or' them together

n = ((n & mask) | (yourvalue << 4))

The downside is you have to keep track of where that data is in memory yourself, we are effectively managing memory ourselves.

Upvotes: 0

silvester_J
silvester_J

Reputation: 51

You can use NumyPy's int as np.int8 or np.int16.

Upvotes: 5

user12658139
user12658139

Reputation: 21

Using bytearray in python which is basically a C unsigned char array under the hood will be a better solution than using large integers. There is no overhead for manipulating a byte array and, it has much less storage overhead compared to large integers. It's possible to get storage density of 7.99+ bits per byte with bytearrays.

>>> import sys
>>> a = bytearray(2**32)
>>> sys.getsizeof(a)
4294967353
>>> 8 * 2**32 / 4294967353
7.999999893829228

Upvotes: 1

user12658139
user12658139

Reputation: 21

You can also store multiple any size of integers in a single large integer.

For example as seen below, in python3 on 64bit x86 system, 1024 bits are taking 164 bytes of memory storage. That means on average one byte can store around 6.24 bits. And if you go with even larger integers you can get even higher bits storage density. For example around 7.50 bits per byte with 2**20 bits wide integer.

Obviously you will need some wrapper logic to access individual short numbers stored in the larger integer, which is easy to implement.

One issue with this approach is your data access will slow down due use of the large integer operations.

If you are accessing a big batch of consecutively stored integers at once to minimize the access to large integers, then the slower access to long integers won't be an issue.

I guess use of numpy will be easier approach.

>>> a = 2**1024
>>> sys.getsizeof(a)
164
>>> 1024/164
6.2439024390243905

>>> a = 2**(2**20)
>>> sys.getsizeof(a)
139836
>>> 2**20 / 139836
7.49861266054521

Upvotes: 1

Arnav
Arnav

Reputation: 609

Thanks to Armin for pointing out the 'array' module. I also found the 'struct' module that packs c-style structs in a string:

From the documentation (https://docs.python.org/library/struct.html):

>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('hhl')
8

Upvotes: 5

Tony Meyer
Tony Meyer

Reputation: 10157

Armin's suggestion of the array module is probably best. Two possible alternatives:

  • You can create an extension module yourself that provides the data structure that you're after. If it's really just something like a collection of shorts, then that's pretty simple to do.
  • You can cheat and manipulate bits, so that you're storing one number in the lower half of the Python int, and another one in the upper half. You'd write some utility functions to convert to/from these within your data structure. Ugly, but it can be made to work.

It's also worth realising that a Python integer object is not 4 bytes - there is additional overhead. So if you have a really large number of shorts, then you can save more than two bytes per number by using a C short in some way (e.g. the array module).

I had to keep a large set of integers in memory a while ago, and a dictionary with integer keys and values was too large (I had 1GB available for the data structure IIRC). I switched to using a IIBTree (from ZODB) and managed to fit it. (The ints in a IIBTree are real C ints, not Python integers, and I hacked up an automatic switch to a IOBTree when the number was larger than 32 bits).

Upvotes: 3

Armin Ronacher
Armin Ronacher

Reputation: 32563

Nope. But you can use short integers in arrays:

from array import array
a = array("h") # h = signed short, H = unsigned short

As long as the value stays in that array it will be a short integer.

Upvotes: 45

Related Questions