Reputation: 91
When loading data from parquet or csv files, having the NONE divisions. DASK docs have no information about how to set and calculate this....
How to set up and calculate right the divisions of DASK dataframe?
Upvotes: 4
Views: 4040
Reputation: 91
OK, i do:
divisions =[part_n for part_n in range(f.npartitions)]
f = f.set_index(f.index, divisions=divisions).persist()
Then i do:
f.groupby('userId').first().compute()
But last operation is dramatically slow!
Upvotes: 1
Reputation: 13437
If you read from parquet you can use infer_divisions=True
as in this example
import dask.dataframe as dd
df = dd.read_parquet("file.parq", infer_divisions=True)
In case you need you can directly set an index while reading
df = dd.read_parquet("file.parq", index="my_col",
infer_divisions=True)
Upvotes: 1