Reputation: 21
Below is my Python code:
import dask.dataframe as dd
VALUE2015 = dd.read_csv('A/SKD - M2M by Salesman (value by uom) (NEWSALES)2015-2016.csv', usecols = VALUEFY, dtype = traintypes1)
REPORT = VALUE2015.groupby(index).agg({'JAN':'sum', 'FEB':'sum', 'MAR':'sum', 'APR':'sum', 'MAY':'sum','JUN':'sum', 'JUL':'sum', 'AUG':'sum', 'SEP':'sum', 'OCT':'sum', 'NOV':'sum', 'DEC':'sum'}).compute()
REPORT.to_csv('VALUE*.csv', header=True)
It takes 6 minutes to create a 100MB CSV file.
Upvotes: 1
Views: 788
Reputation: 1446
Looking through Dask documentation, it says there that, "generally speaking, Dask.dataframe groupby-aggregations are roughly same performance as Pandas groupby-aggregations." So unless you're using a Dask distributed client to manage workers, threads, etc., the benefit from using it over vanilla Pandas isn't always there.
Also, try to time each step in your code because if the bulk of the 6 minutes is taken up by writing the .CSV to file on disk, then again Dask will be of no help (for a single file).
Here's a nice tutorial from Dask on adding distributed schedulers for your tasks.
Upvotes: 1