How to repartition a dataframe into fixed sized partitions?

Question

I have a dask dataframe created from delayed functions which is comprised of randomly sized partitions. I would like to repartition the dataframe into chunks of size (approx) 10000.

I can calculate the correct number of partitions with np.ceil(df.size/10000) but that seems to immediately compute the result?

IIUC to compute the result it would have had to read all the dataframes into memory which would be very inefficient. I would instead like to specify the whole operation as a dask graph to be submitted to the distributed scheduler so no calculations should be done locally.

Is there some way to specify npartitions without having it immediately compute all the underlying delayed functions?

MRocklin · Accepted Answer

Short answer is probably "no, there is no way to do this without looking at the data". The reason here is that the structure of the graph depends on the values of your lazy partitions. For example we'll have a different number of nodes in the graph depending on your total datasize.

How to repartition a dataframe into fixed sized partitions?

Answers (1)

Related Questions