Reputation: 91

How to set up (calculate) divisions in dask dataframe?

When loading data from parquet or csv files, having the NONE divisions. DASK docs have no information about how to set and calculate this....

How to set up and calculate right the divisions of DASK dataframe?

Upvotes: 4

Answers (2)

Reputation: 91

OK, i do:

divisions =[part_n for part_n in range(f.npartitions)]
f = f.set_index(f.index, divisions=divisions).persist()

Then i do:

f.groupby('userId').first().compute()

But last operation is dramatically slow!

Upvotes: 1

Reputation: 13437

If you read from parquet you can use infer_divisions=True as in this example

import dask.dataframe as dd
df = dd.read_parquet("file.parq", infer_divisions=True)

In case you need you can directly set an index while reading

df = dd.read_parquet("file.parq", index="my_col",
                     infer_divisions=True)

Upvotes: 1