Bouncner
Bouncner

Reputation: 2369

Is it possible to manually create Dask data frames? (i.e., not by a fixed partition count)

I would like to define a way in which a data dataframe is created (e.g., a particular criteria for splitting) or be able to manually create one.

The situation: I have a Python function that traverses a subset of a large data frame. The traversal can be limited to all rows that match a certain key. So I need to ensure that this key is not split over several partitions. Currently, I am splitting the input data frame (Pandas) manually and use multiprocessing to process each partition separately.

I would love to use Dask, which I also user for other computations, due to its ease of use. But I can't find a way to manually define how the input dataframe is split in order to later use map_partitions.

Or am I on a completely wrong path here and should other methods of Dask?

Upvotes: 0

Views: 142

Answers (1)

MRocklin
MRocklin

Reputation: 57281

You might find using dask delayed useful and then use that to create a custom dask dataframe? https://docs.dask.org/en/latest/dataframe-create.html#dask-delayed

Upvotes: 1

Related Questions