Reputation: 23
I'm trying to create aggregated statistics with dask
arrays. map_blocks
seems ideal but can't get it to work.
I'm new to dask
so trying to understand the way it works. I plan to use custom functions and started with some basics. I've got stuck and can't see a solution after a few hours of trial & error.
import dask
import dask.array as da
from numpy import median,array
def func(a):
m = median(a)
print(m)
return array(m)
x = da.random.random((10000, 10000), chunks=(5000, 5000))
x.map_blocks(func,chunks=(1,1)).compute()
I would expect a new array with the results per block, but get:
nan
0.5001597269075302
0.49996143572562185
0.49994227403711916
0.5001512434686584
Traceback (most recent call last):
...
result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
IndexError: tuple index out of range
Upvotes: 2
Views: 608
Reputation: 358
map_blocks
can be slightly tricky at first. The problem here is that func
returns an array of shape ()
, while in map_blocks
you indicate output chunks of (1,1)
.
If I understand you correctly, you want to replace each chunk of x
by its median (these would be new chunks of size (1,1)
). To do so, you need to output an array with that shape. See the following code:
import dask
import dask.array as da
from numpy import median,array
def func(a):
m = median(a)
print(m)
return array(m)[None,None] # add dummy dimensions
# x = da.random.random((10000, 10000), chunks=(5000, 5000))
x = da.random.random((100, 100), chunks=(50, 50)) # try things out on small array
x.map_blocks(func,chunks=(1,1)).compute()
Indexing an array with None
adds a dummy dimension to it. Therefore, array(m)[None,None]
will have the desired shape (1,1)
.
Also, for playing with these things until they work out it makes sense to work on small data, which I added in the above example.
Upvotes: 3