produce vector output from a dask array

Question

I have a large dask array (labeled_arr) that is actually a labeled raster image (dtype is int64). I want to use rasterio to turn the labeled regions into polygons and combine them into a single list of polygons (or geoseries with just a geometry column). This is a straightforward task on a single array, but I'm having trouble figuring out how to tell dask that I want it to do this operation on each chunk and return something that is not an array.

function to apply to each chunk:

def get_polys(labeled_blocks):
    polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
                                labeled_blocks.astype('int32'), transform=trans))[:-1]
    # Note: rasterio.features.shapes returns an iterator, hence the conversion to a list here
    return polys

line of code trying to get dask to do this:

test_polygons = da.blockwise(get_polys, '', labeled_arr, 'ij')
test_polygons.compute()

where labeled_arr is the input chunked dask array.

Running as is returns an error saying I have to specify a dtype for da.blockwise. Specifying a dtype returns an AttributeError since the output list type does not have a dtype attribute. I discovered the meta keyword, but still have been unable to get the right syntax to turn my output into a Series or list.

I'm not attached to the above approach, but my overarching goal is: take a labeled, chunked dask dataarray (which does not all fit in memory), extract a list based on computations for each chunk, and generate a concatenated list (or pandas data object) with the outputs from all the chunks in my original chunked array.

SultanOrazbayev · Accepted Answer

This might work:

import dask
import dask.array as da

# we expect to see 4 blocks here
test_array = da.random.random((4, 4), chunks=(2, 2))

@dask.delayed
def my_func(block):
    # do something fancy
    return list(block)

results = dask.compute([my_func(x) for x in test_array.to_delayed().ravel()])

As you noted, the problem is that list has no dtype. A way around this would be to convert the list into a np.array, but I'm not sure if this will work with all geometry objects (it should be OK for Points, but polygons might be problematic due to varying length). Since you are not interested in forcing these geometries into an array, it's best to treat individual blocks as delayed objects feeding them into your function one at a time (but scaled across workers/processes).

produce vector output from a dask array

Answers (2)

Related Questions