Dask array from_npy_stack misses info file

Question

Action

Trying to create a Dask array from a stack of .npy files not written by Dask.

Problem

Dask from_npy_stack() expects an info file, which is normally created by to_npy_stack() function when creating .npy stack with Dask.

Attempts

I found this PR (https://github.com/dask/dask/pull/686) with a description of how the info file is created

def to_npy_info(dirname, dtype, chunks, axis):
    with open(os.path.join(dirname, 'info'), 'wb') as f:
        pickle.dump({'chunks': chunks, 'dtype': x.dtype, 'axis': axis}, f)

Question

How do I go about loading .npy stacks that are created outside of Dask?

Example

from pathlib import Path
import numpy as np
import dask.array as da

data_dir = Path('/home/tom/data/')

for i in range(3):
    data = np.zeros((2,2))
    np.save(data_dir.joinpath('{}.npy'.format(i)), data)

data = da.from_npy_stack('/home/tom/data')

Resulting in the following error:

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
 in ()
      9     np.save(data_dir.joinpath('{}.npy'.format(i)), data)
     10 
---> 11 data = da.from_npy_stack('/home/tom/data/')

/home/tom/vue/env/local/lib/python2.7/site-packages/dask/array/core.pyc in from_npy_stack(dirname, mmap_mode)
   3722         Read data in memory map mode
   3723     """
-> 3724     with open(os.path.join(dirname, 'info'), 'rb') as f:
   3725         info = pickle.load(f)
   3726 

IOError: [Errno 2] No such file or directory: '/home/tom/data/info'

mdurant · Accepted Answer

The function from_npy_stack is short and simple. Agree that it probably ought to take the metadata as an optional argument for cases such as yours, but you could simply use the lines of code after loading the "info" file assuming you have the right values to. Some of these values, i.e., dtype and the shape of each array for making chunks, could presumably be obtained by looking at the first of the data files

name = 'from-npy-stack-%s' % dirname
keys = list(product([name], *[range(len(c)) for c in chunks]))
values = [(np.load, os.path.join(dirname, '%d.npy' % i), mmap_mode)
          for i in range(len(chunks[axis]))]
dsk = dict(zip(keys, values))

out = Array(dsk, name, chunks, dtype)

Also, note that we are constructing the names of the files here, but you might want to get those by doing a listdir or glob.

Dask array from_npy_stack misses info file

Answers (1)

Related Questions