Reputation: 85
I need to hold a very large vector in memory, about 10**8 in size, and I need a fast random access to it. I tried to use numpy.memmap, but encountered the following error:
RuntimeWarning: overflow encountered in int_scalars bytes = long(offset + size*_dbytes)
fid.seek(bytes - 1, 0): [Errno 22] Invalid argument
It seems that the memmap is using a long and my vector length is too big.
Is there a way to overcome this and use memmap? or maybe there is a good alternative?
Thanks
Upvotes: 2
Views: 899
Reputation: 13999
It sounds like you're using a 32-bit version of Python (I also assume you're running on Windows). From the numpy.memmap
docs:
Memory-mapped files cannot be larger than 2GB on 32-bit systems.
So the simple solution to your problem is to just upgrade your Python install to 64-bit.
If your CPU was manufactured sometime in that last decade, it should be possible to upgrade to 64-bit Python.
So long as your Python is 32-bit, working with arrays larger than 2 GB is never going to be easy or straightforward. Your only real option is to split the array up into pieces no larger than 2 GB at the time you originally create it/write it out to disk. You would then operate on each piece independently.
Also, you'd still have to use numpy.memmap
with each piece, since Python itself will run out of memory otherwise.
If handling these kinds of large arrays is something you have to do a lot of on a regular basis, you should consider switching your code/workflow over one of the big data frameworks. There's a whole bunch of them available for Python now. I've used Pyspark extensively before, and it's pretty easy to use (though requires a bunch of set up). In the comments B. M. mentions Dask, another such big data framework.
Though if this is just a one off task, it's probably not worth the trouble to spin up one of these frameworks.
Upvotes: 2