How can I create a Dask array from zipped .npy files?

Question

I have a large dataset stored as zipped npy files. How can I stack a given subset of these into a Dask array?

I'm aware of dask.array.from_npy_stack but I don't know how to use it for this.

Here's a crude first attempt that uses up all my memory:

import numpy as np
import dask.array as da

data = np.load('data.npz')

def load(files):
    list_ = [da.from_array(data[file]) for file in files]
    return da.stack(list_)

x = load(['foo', 'bar'])

MRocklin · Accepted Answer

Well, you can't load a large npz file into memory, because then you're already out of memory. I would read each one in in a delayed fashion, and then call da.from_array and da.stack as you sort of are in your example.

Here are some docs that may help if you haven't seen them before: https://docs.dask.org/en/latest/array-creation.html#using-dask-delayed

How can I create a Dask array from zipped .npy files?

Answers (1)

Related Questions