Reputation: 3778
I would like to load as much data, as is safe, so that the current process works fine as well as other processess. I would prefer to use RAM only (not using swap) but any suggestions are welcome. Excessive data can be discarded. What is the proper way of doing this? If I just wait for MemoryException
, the system become not operable (if using list).
data_storage = []
for data in read_next_data():
data_storage.append(data)
The data is finally to be loaded into numpy array.
Upvotes: 3
Views: 2319
Reputation: 152840
psutil
has a virtual_memory
function that contains, beside others, an attribute representing the free memory:
>>> psutil.virtual_memory()
svmem(total=4170924032, available=1743937536, percent=58.2, used=2426986496, free=1743937536)
>>> psutil.virtual_memory().free
1743937536
That should be pretty accurate (but the function call is costly -slow- at least on Windows). The MemoryError
doesn't take memory used by other proccesses into account so it's only raised if the memory of the array exceeds the total avaiable (free or not) RAM.
You may have to guess at which point you stop accumulating because the free memory can change (other processes also need some additional memory from time to time) and the conversion to numpy.array
might temporarly double your used memory because at that time the list and the array must fit into your RAM.
However you can approach this also in different way:
read_next_data()
.psutil.virtual_memory().free
shape
of the first dataset and the dtype
to calculate the shape of the array that fits easily into the RAM. Let's say it uses factor
(i.e. 75%) of the avaiable free memory: rows= freeMemory * factor / (firstDataShape * memoryPerElement)
that should give you the number of datasets that you read in at once.arr = np.empty((rows, *firstShape), dtype=firstDtype)
.arr[i] = next(read_next_data)
. That way you you don't keep these lists around and you avoid the doubled memory.Upvotes: 5