Riley Hun
Riley Hun

Reputation: 2775

How to split dask dataframe into partitions based on unique values in a column?

I have a dask dataframe with a date column doc_date that is in the range 12-1-2021 to 1-2-2022. I want to repartition and split this dask dataframe into the 26 partitions such that each partition has only 1 of the dates in the aforementioned date range.

Here's what I tried:

doc_dates = [dt.strftime("%Y-%m-%d") for dt in pd.date_range('2021-12-08', '2022-01-02')]
predictions_df = predictions_df.set_index('doc_date')
predictions_df = predictions_df.repartition(divisions=sorted(doc_dates))

But I appear to be getting this error:

ValueError: left side of old and new divisions are different

Upvotes: 2

Views: 2064

Answers (1)

Michael Delgado
Michael Delgado

Reputation: 15432

The issue is that you need to pass compute=True to dask.dataframe.set_index to ensure that the data is actually sorted by date before you can provide the sorted list of dates to the repartition command:

predictions_df = predictions_df.set_index('doc_date', compute=True)
predictions_df = predictions_df.repartition(divisions=sorted(doc_dates))

Alternatively, you can use the divisions argument to dask.dataframe.set_index:

predictions_df = predictions_df.set_index(
    'doc_date',
    divisions=sorted(doc_dates),
    compute=True,
)

Upvotes: 2

Related Questions