Reputation: 73
Not I have 50GB dataset saved as h5py, which is a dictionary inside. The dictionary contains keys from 0 to n, and the values are numpy ndarray(3 dimension) which have the same shape. For example:
dictionary[0] = np.array([[[...],[...]]...])
I want to concat all these np arrays, code like
sample = np.concatenate(list(dictionary.values))
this operation waste 100GB memory! If I use
del dictionary
It will decrease to 50GB memory. But I want to control the memory usage as 50GB during loading data. Another way I tried like this
sample = np.concatenate(sample,dictionary[key])
It is still using 100GB memory. I think all the cases above, the right side will create a new memory block to save, and then assigned to the left side, which will double the memory during calculations. Thus, the third way I tried like this
sample = np.empty(shape)
with h5py.File(...) as dictionary:
for key in dictionary.keys():
sample[key] = dictionary[key]
I think this code has an advantage. The value dictionary[key] assigned to some row of sample, then the memory of dictionary[key] will clear. However, I test it and find that the memory usage is also 100GB. Why?
Are there any good methods to limit the memory usage as 50GB?
Upvotes: 6
Views: 3898
Reputation: 231475
dictionary[key]
is a dataset on the file. dictionary[key][...]
will be an numpy array, that dataset downloaded.
I imagine
sample[key] = dictionary[key]
is evaluated as
sample[key,...] = dictionary[key][...]
The dataset is downloaded, and then copied to a slice of the sample
array. That downloaded array should be free for recycling. But whether numpy/python does that is another matter. I'm not in the habit of pushing memory limits.
You don't want to do the incremental concatenate - that's slow. A single concatenate on the list should be faster. I don't know for such what
list(dictionary.values)
contains. Will it be references to the datasets, or downloaded arrays? Regardless concatenate(...)
on that list will have to used the downloaded arrays.
One thing puzzles me - how can you use the same key
to index the first dimension of sample
and dataset in dictionary
? h5py
keys are supposed to be strings, not integers.
Some testing
Note that I'm using string dataset names:
In [21]: d = f.create_dataset('0',data=np.zeros((2,3)))
In [22]: d = f.create_dataset('1',data=np.zeros((2,3)))
In [23]: d = f.create_dataset('2',data=np.ones((2,3)))
In [24]: d = f.create_dataset('3',data=np.arange(6.).reshape(2,3))
Your np.concatenate(list(dictionary.values))
code is missing ()
:
In [25]: f.values
Out[25]: <bound method MappingHDF5.values of <HDF5 file "test.hf" (mode r+)>>
In [26]: f.values()
Out[26]: ValuesViewHDF5(<HDF5 file "test.hf" (mode r+)>)
In [27]: list(f.values())
Out[27]:
[<HDF5 dataset "0": shape (2, 3), type "<f8">,
<HDF5 dataset "1": shape (2, 3), type "<f8">,
<HDF5 dataset "2": shape (2, 3), type "<f8">,
<HDF5 dataset "3": shape (2, 3), type "<f8">]
So it's just a list of the datasets. The downloading occurs when concatenate
does a np.asarray(a)
for each element of the list:
In [28]: np.concatenate(list(f.values()))
Out[28]:
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[1., 1., 1.],
[1., 1., 1.],
[0., 1., 2.],
[3., 4., 5.]])
e.g.:
In [29]: [np.array(a) for a in f.values()]
Out[29]:
[array([[0., 0., 0.],
[0., 0., 0.]]), array([[0., 0., 0.],
[0., 0., 0.]]), array([[1., 1., 1.],
[1., 1., 1.]]), array([[0., 1., 2.],
[3., 4., 5.]])]
In [30]: [a[...] for a in f.values()]
....
Let's look at what happens when using your iteration approach:
Make an array that can takes one dataset for each 'row':
In [34]: samples = np.zeros((4,2,3),float)
In [35]: for i,d in enumerate(f.values()):
...: v = d[...]
...: print(v.__array_interface__['data']) # databuffer location
...: samples[i,...] = v
...:
(27845184, False)
(27815504, False)
(27845184, False)
(27815504, False)
In [36]: samples
Out[36]:
array([[[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.]],
[[1., 1., 1.],
[1., 1., 1.]],
[[0., 1., 2.],
[3., 4., 5.]]])
In this small example, it recycled every other databuffer block. The 2nd iteration frees up the databuffer used in the first, which can then be reused in the 3rd, and so on.
These are small arrays in a interactive ipython
session. I don't know if these observations apply in large cases.
Upvotes: 0
Reputation: 927
Your problem is that you need to have 2 copies of the same data in memory.
If you build the array as in test1
you'll need far less memory at once, but at the cost of losing the dictionary.
import numpy as np
import time
def test1(n):
a = {x:(x, x, x) for x in range(n)} # Build sample data
b = np.array([a.pop(i) for i in range(n)]).reshape(-1)
return b
def test2(n):
a = {x:(x, x, x) for x in range(n)} # Build sample data
b = np.concatenate(list(a.values()))
return b
x1 = test1(1000000)
del x1
time.sleep(1)
x2 = test2(1000000)
Results:
test1 : 0.71 s
test2 : 1.39 s
The first peek is for test1, it's not exactly in place but it reduces the memory usage quite a bit.
Upvotes: 1