Is it possible to manually create Dask data frames? (i.e., not by a fixed partition count)

Question

I would like to define a way in which a data dataframe is created (e.g., a particular criteria for splitting) or be able to manually create one.

The situation: I have a Python function that traverses a subset of a large data frame. The traversal can be limited to all rows that match a certain key. So I need to ensure that this key is not split over several partitions. Currently, I am splitting the input data frame (Pandas) manually and use multiprocessing to process each partition separately.

I would love to use Dask, which I also user for other computations, due to its ease of use. But I can't find a way to manually define how the input dataframe is split in order to later use map_partitions.

Or am I on a completely wrong path here and should other methods of Dask?

Is it possible to manually create Dask data frames? (i.e., not by a fixed partition count)

Answers (1)

Related Questions