Reputation: 609
Python allocates integers automatically based on the underlying system architecture. Unfortunately I have a huge dataset which needs to be fully loaded into memory.
So, is there a way to force Python to use only 2 bytes for some integers (equivalent of C++ 'short')?
Upvotes: 32
Views: 70598
Reputation: 1
You can make an int into a bunch of smaller ints, then access specific bits from them:
n = 4532 # '0b1000110110100'
mask = 0b000011110000 # We want to access the middle 4 bits
mid = ((n & mask) << 4) # Leave only specified data and move places back
For putting data in, first use the same mask to blank the part of the int you need to use, then bitshift the new data into position then 'or' them together
n = ((n & mask) | (yourvalue << 4))
The downside is you have to keep track of where that data is in memory yourself, we are effectively managing memory ourselves.
Upvotes: 0
Reputation: 21
Using bytearray in python which is basically a C unsigned char array under the hood will be a better solution than using large integers. There is no overhead for manipulating a byte array and, it has much less storage overhead compared to large integers. It's possible to get storage density of 7.99+ bits per byte with bytearrays.
>>> import sys
>>> a = bytearray(2**32)
>>> sys.getsizeof(a)
4294967353
>>> 8 * 2**32 / 4294967353
7.999999893829228
Upvotes: 1
Reputation: 21
You can also store multiple any size of integers in a single large integer.
For example as seen below, in python3 on 64bit x86 system, 1024 bits are taking 164 bytes of memory storage. That means on average one byte can store around 6.24 bits. And if you go with even larger integers you can get even higher bits storage density. For example around 7.50 bits per byte with 2**20 bits wide integer.
Obviously you will need some wrapper logic to access individual short numbers stored in the larger integer, which is easy to implement.
One issue with this approach is your data access will slow down due use of the large integer operations.
If you are accessing a big batch of consecutively stored integers at once to minimize the access to large integers, then the slower access to long integers won't be an issue.
I guess use of numpy will be easier approach.
>>> a = 2**1024
>>> sys.getsizeof(a)
164
>>> 1024/164
6.2439024390243905
>>> a = 2**(2**20)
>>> sys.getsizeof(a)
139836
>>> 2**20 / 139836
7.49861266054521
Upvotes: 1
Reputation: 609
Thanks to Armin for pointing out the 'array' module. I also found the 'struct' module that packs c-style structs in a string:
From the documentation (https://docs.python.org/library/struct.html):
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('hhl')
8
Upvotes: 5
Reputation: 10157
Armin's suggestion of the array module is probably best. Two possible alternatives:
It's also worth realising that a Python integer object is not 4 bytes - there is additional overhead. So if you have a really large number of shorts, then you can save more than two bytes per number by using a C short in some way (e.g. the array module).
I had to keep a large set of integers in memory a while ago, and a dictionary with integer keys and values was too large (I had 1GB available for the data structure IIRC). I switched to using a IIBTree (from ZODB) and managed to fit it. (The ints in a IIBTree are real C ints, not Python integers, and I hacked up an automatic switch to a IOBTree when the number was larger than 32 bits).
Upvotes: 3
Reputation: 32563
Nope. But you can use short integers in arrays:
from array import array
a = array("h") # h = signed short, H = unsigned short
As long as the value stays in that array it will be a short integer.
Upvotes: 45