Reputation: 1658
This differs from Write multiple numpy arrays to file in that I need to be able to stream content, rather than writing it all at once.
I need to write multiple compressed numpy arrays in binary to a file. I can not store all the arrays in memory before writing so it is more like streaming numpy arrays to a file.
This currently works fine as text
file = open("some file")
while doing stuff: file.writelines(somearray + "\n") where some array is a new instance every loop
however this does not work if i try and write the arrays as binary.
arrays are created at 30hz and grow too big to keep in memory. They also can not each be stored into a bunch of single array files because that would just be wasteful and cause a huge mess.
So i would like only one file per a session instead of 10k files per a session.
Upvotes: 7
Views: 4814
Reputation: 2981
One option might be to use pickle to save the arrays to a file opened as an append binary
file:
import numpy as np
import pickle
arrays = [np.arange(n**2).reshape((n,n)) for n in range(1,11)]
with open('test.file', 'ab') as f:
for array in arrays:
pickle.dump(array, f)
new_arrays = []
with open('test.file', 'rb') as f:
while True:
try:
new_arrays.append(pickle.load(f))
except EOFError:
break
assert all((new_array == array).all() for new_array, array in zip(new_arrays, arrays))
This might not be the fastest, but it should be fast enough. It might seem like this would take up more data, but comparing these:
x = 300
y = 300
arrays = [np.random.randn(x, y) for x in range(30)]
with open('test2.file', 'ab') as f:
for array in arrays:
pickle.dump(array, f)
with open('test3.file', 'ab') as f:
for array in arrays:
f.write(array.tobytes())
with open('test4.file', 'ab') as f:
for array in arrays:
np.save(f, array)
You'll find the file sizes as 1,025 KB, 1,020 KB, and 1,022 KB respectively.
Upvotes: 3
Reputation: 114911
An NPZ file is just a zip archive, so you could save each array to a temporary NPY file, add that NPY file to the zip archive, and then delete the temporary file.
For example,
import os
import zipfile
import numpy as np
# File that will hold all the arrays.
filename = 'foo.npz'
with zipfile.ZipFile(filename, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
for i in range(10):
# `a` is the array to be written to the file in this iteration.
a = np.random.randint(0, 10, size=20)
# Name for the temporary file to which `a` is written. The root of this
# filename is the name that will be assigned to the array in the npz file.
# I've used 'arr_{}' (e.g. 'arr_0', 'arr_1', ...), similar to how `np.savez`
# treats positional arguments.
tmpfilename = "arr_{}.npy".format(i)
# Save `a` to a npy file.
np.save(tmpfilename, a)
# Add the file to the zip archive.
zf.write(tmpfilename)
# Delete the npy file.
os.remove(tmpfilename)
Here's an example where that script is run, and then the data is read back using np.load
:
In [1]: !ls
add_array_to_zip.py
In [2]: run add_array_to_zip.py
In [3]: !ls
add_array_to_zip.py foo.npz
In [4]: foo = np.load('foo.npz')
In [5]: foo.files
Out[5]:
['arr_0',
'arr_1',
'arr_2',
'arr_3',
'arr_4',
'arr_5',
'arr_6',
'arr_7',
'arr_8',
'arr_9']
In [6]: foo['arr_0']
Out[6]: array([0, 9, 3, 7, 2, 2, 7, 2, 0, 5, 8, 1, 1, 0, 4, 2, 5, 1, 8, 2])
You'll have to test this on your system to see if it can keep up with your array generation process.
Another alternative is to use something like HDF5, with either h5py or pytables.
Upvotes: 4