Reputation: 9335
I created a Dask dataframe from a Pandas dataframe that is ~50K rows and 5 columns:
ddf = dd.from_pandas(df, npartitions=32)
I then add a bunch of columns (~30) to the dataframe and try to turn it back into a Pandas dataframe:
DATA = ddf.compute(get = dask.multiprocessing.get)
I looked at the docs and if I don't specify num_workers
, it defaults to using all my cores. I'm on a 64 core EC2 instance and the above line has taken minutes already without finishing...
Any idea how to speed up or what I'm doing incorrectly?
Thanks!
Upvotes: 4
Views: 757
Reputation: 395
I'd suggest to try lowering the amount of threads and increasing the amount of processes to help speed up things.
Upvotes: 2