Reputation: 2060
Action
Trying to create a Dask array from a stack of .npy
files not written by Dask.
Problem
Dask from_npy_stack()
expects an info
file, which is normally created by to_npy_stack()
function when creating .npy
stack with Dask.
Attempts
I found this PR (https://github.com/dask/dask/pull/686) with a description of how the info file is created
def to_npy_info(dirname, dtype, chunks, axis):
with open(os.path.join(dirname, 'info'), 'wb') as f:
pickle.dump({'chunks': chunks, 'dtype': x.dtype, 'axis': axis}, f)
Question
How do I go about loading .npy
stacks that are created outside of Dask?
Example
from pathlib import Path
import numpy as np
import dask.array as da
data_dir = Path('/home/tom/data/')
for i in range(3):
data = np.zeros((2,2))
np.save(data_dir.joinpath('{}.npy'.format(i)), data)
data = da.from_npy_stack('/home/tom/data')
Resulting in the following error:
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-94-54315c368240> in <module>()
9 np.save(data_dir.joinpath('{}.npy'.format(i)), data)
10
---> 11 data = da.from_npy_stack('/home/tom/data/')
/home/tom/vue/env/local/lib/python2.7/site-packages/dask/array/core.pyc in from_npy_stack(dirname, mmap_mode)
3722 Read data in memory map mode
3723 """
-> 3724 with open(os.path.join(dirname, 'info'), 'rb') as f:
3725 info = pickle.load(f)
3726
IOError: [Errno 2] No such file or directory: '/home/tom/data/info'
Upvotes: 2
Views: 797
Reputation: 28673
The function from_npy_stack
is short and simple. Agree that it probably ought to take the metadata as an optional argument for cases such as yours, but you could simply use the lines of code after loading the "info"
file assuming you have the right values to. Some of these values, i.e., dtype
and the shape of each array for making chunks
, could presumably be obtained by looking at the first of the data files
name = 'from-npy-stack-%s' % dirname
keys = list(product([name], *[range(len(c)) for c in chunks]))
values = [(np.load, os.path.join(dirname, '%d.npy' % i), mmap_mode)
for i in range(len(chunks[axis]))]
dsk = dict(zip(keys, values))
out = Array(dsk, name, chunks, dtype)
Also, note that we are constructing the names of the files here, but you might want to get those by doing a listdir
or glob
.
Upvotes: 2