Reputation: 2840
I am reading in 64 compressed csv files (probably 70-80 GB) into one dask data frame then run groupby with aggregations.
The job never completed because appereantly the groupby creates a data frame with only one partition.
This post and this post already addressed this issue but focusing on the computational graph and not the memory issue you run into, when your resulting data frame is too large.
I tried a workaround with repartioning but the job still wont complete.
What am I doing wrong, will I have to use map_partition? This is very confusing as I expect Dask will take care of partitioning everything even after aggregation operations.
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1, memory_limit='8GB',diagnostics_port=5000)
client
dask.config.set(scheduler='processes')
dB3 = dd.read_csv("boden/expansion*.csv", # read in parallel
blocksize=None, # 64 files
sep=',',
compression='gzip'
)
aggs = {
'boden': ['count','min']
}
dBSelect=dB3.groupby(['lng','lat']).agg(aggs).repartition(npartitions=64)
dBSelect=dBSelect.reset_index()
dBSelect.columns=['lng','lat','bodenCount','boden']
dBSelect=dBSelect.drop('bodenCount',axis=1)
with ProgressBar(dt=30): dBSelect.compute().to_parquet('boden/final/boden_final.parq',compression=None)
Upvotes: 2
Views: 735
Reputation: 57271
Most groupby aggregation outputs are small and fit easily in one partition. Clearly this is not the case in your situation.
To resolve this you should use the split_out=
parameter to your groupby aggregation to request a certain number of output partitions.
df.groupby(['x', 'y', 'z']).mean(split_out=10)
Note that using split_out=
will significantly increase the size of the task graph (it has to mildly shuffle/sort your data ahead of time) and so may increase scheduling overhead.
Upvotes: 3