map_blocks with change of dimensions returns IndexError: tuple index out of range

Question

I'm trying to create aggregated statistics with dask arrays. map_blocks seems ideal but can't get it to work.

I'm new to dask so trying to understand the way it works. I plan to use custom functions and started with some basics. I've got stuck and can't see a solution after a few hours of trial & error.

import dask
import dask.array as da
from numpy import median,array

def func(a):
    m = median(a)
    print(m)
    return array(m)

x = da.random.random((10000, 10000), chunks=(5000, 5000))

x.map_blocks(func,chunks=(1,1)).compute()

I would expect a new array with the results per block, but get:

nan
0.5001597269075302
0.49996143572562185
0.49994227403711916
0.5001512434686584
Traceback (most recent call last):
  ...
    result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
IndexError: tuple index out of range

malbert · Accepted Answer

map_blocks can be slightly tricky at first. The problem here is that func returns an array of shape (), while in map_blocks you indicate output chunks of (1,1).

If I understand you correctly, you want to replace each chunk of x by its median (these would be new chunks of size (1,1)). To do so, you need to output an array with that shape. See the following code:

import dask
import dask.array as da
from numpy import median,array

def func(a):
    m = median(a)
    print(m)
    return array(m)[None,None] # add dummy dimensions

# x = da.random.random((10000, 10000), chunks=(5000, 5000))
x = da.random.random((100, 100), chunks=(50, 50)) # try things out on small array

x.map_blocks(func,chunks=(1,1)).compute()

Indexing an array with None adds a dummy dimension to it. Therefore, array(m)[None,None] will have the desired shape (1,1).

Also, for playing with these things until they work out it makes sense to work on small data, which I added in the above example.

map_blocks with change of dimensions returns IndexError: tuple index out of range

Answers (1)

Related Questions