Reputation: 2775
I have a dask dataframe with a date column doc_date
that is in the range 12-1-2021
to 1-2-2022
. I want to repartition and split this dask dataframe into the 26 partitions such that each partition has only 1 of the dates in the aforementioned date range.
Here's what I tried:
doc_dates = [dt.strftime("%Y-%m-%d") for dt in pd.date_range('2021-12-08', '2022-01-02')]
predictions_df = predictions_df.set_index('doc_date')
predictions_df = predictions_df.repartition(divisions=sorted(doc_dates))
But I appear to be getting this error:
ValueError: left side of old and new divisions are different
Upvotes: 2
Views: 2064
Reputation: 15432
The issue is that you need to pass compute=True
to dask.dataframe.set_index
to ensure that the data is actually sorted by date before you can provide the sorted list of dates to the repartition
command:
predictions_df = predictions_df.set_index('doc_date', compute=True)
predictions_df = predictions_df.repartition(divisions=sorted(doc_dates))
Alternatively, you can use the divisions
argument to dask.dataframe.set_index
:
predictions_df = predictions_df.set_index(
'doc_date',
divisions=sorted(doc_dates),
compute=True,
)
Upvotes: 2