Reputation: 2369
I would like to define a way in which a data dataframe is created (e.g., a particular criteria for splitting) or be able to manually create one.
The situation:
I have a Python function that traverses a subset of a large data frame. The traversal can be limited to all rows that match a certain key. So I need to ensure that this key is not split over several partitions.
Currently, I am splitting the input data frame (Pandas) manually and use multiprocessing
to process each partition separately.
I would love to use Dask, which I also user for other computations, due to its ease of use. But I can't find a way to manually define how the input dataframe is split in order to later use map_partitions
.
Or am I on a completely wrong path here and should other methods of Dask?
Upvotes: 0
Views: 142
Reputation: 57281
You might find using dask delayed useful and then use that to create a custom dask dataframe? https://docs.dask.org/en/latest/dataframe-create.html#dask-delayed
Upvotes: 1