Reputation: 1407
I want to put many large long integers into memory without any space between them. How to do that with python 2.7 code in linux?
The large long integers all use the same number of bits. There is totally about 4 gb of data. Leaving spaces of a few bits to make each long integer uses multiples of 8 bits in memory is ok. I want to do bitwise operation on them later.
So far, I am using a python list. But I am not sure if that leaves no space in memory between the integers. Can ctypes help?
Thank you.
The old code uses bitarray (https://pypi.python.org/pypi/bitarray/0.8.1)
import bitarray
data = bitarray.bitarray()
with open('data.bin', 'rb') as f:
data.fromfile(f)
result = data[:750000] & data[750000:750000*2]
This works and the bitarray doesn't have gap in memory. But, the bitarray bitwise and is slower than the native python's bitwise operation of long integer by about 6 times on the computer. Slicing the bitarray in the old code and accessing an element in the list in the newer code use roughly the same amount of time.
Newer code:
import cPickle as pickle
with open('data.pickle', 'rb') as f:
data = pickle.load(f)
# data is a list of python's (long) integers
result = data[0] & data[1]
Numpy: In the above code. result = data[0] & data[1] creates a new long integer. Numpy has the out option for numpy.bitwise_and. That would avoid creating a new numpy array. Yet, numpy's bool array seems to use one byte per bool instead of one bit per bool. While, converting the bool array into a numpy.uint8 array avoids this problem, counting the number of set bit is too slow.
The python's native array can't handle the large long integers:
import array
xstr = ''
for i in xrange(750000):
xstr += '1'
x = int(xstr, 2)
ar = array.array('l',[x,x,x])
# OverflowError: Python int too large to convert to C long
Upvotes: 3
Views: 252
Reputation: 11424
The primary cause for space inefficiency is the internal structure of a Python long. Assuming a 64-bit platform, Python only uses 30 out of 32 bits to store a value. The gmpy2 library provides access to the GMP (GNU Multiple Precision Arithmetic Library). The internal structure of the gmpy2.mpz
type uses all available bits. Here is the size difference for storing a 750000-bit value.
>>> import gmpy2
>>> import sys
>>> a=long('1'*750000, 2)
>>> sys.getsizeof(a)
100024
>>> sys.getsizeof(gmpy2.mpz(a))
93792
The &
operation with ``gmpy2.mpz` is also significantly faster.
$ python -m timeit -s "a=long('A'*93750,16);b=long('7'*93750)" "c=a & b"
100000 loops, best of 3: 7.78 usec per loop
$ python -m timeit -s "import gmpy2;a=gmpy2.mpz('A'*93750,16);b=gmpy2.mpz('7'*93750)" "c=a & b"
100000 loops, best of 3: 4.44 usec per loop
If all your operations are in-place, the gmpy2.xmpz
type allows changing the internal value of an instance without creating a new instance. It is faster as long as all the operations are immediate.
$ python -m timeit -s "import gmpy2;a=gmpy2.xmpz('A'*93750,16);b=gmpy2.xmpz('7'*93750)" "a &= b"
100000 loops, best of 3: 3.31 usec per loop
Disclaimer: I maintain the gmpy2
library.
Upvotes: 2