Raichan Abdikar
Raichan Abdikar

Reputation: 21

Why does my code take so long to write CSV file in Dask Python

Below is my Python code:

import dask.dataframe as dd

VALUE2015 = dd.read_csv('A/SKD - M2M by Salesman (value by uom) (NEWSALES)2015-2016.csv', usecols = VALUEFY, dtype = traintypes1) 

REPORT = VALUE2015.groupby(index).agg({'JAN':'sum', 'FEB':'sum', 'MAR':'sum', 'APR':'sum', 'MAY':'sum','JUN':'sum', 'JUL':'sum', 'AUG':'sum', 'SEP':'sum', 'OCT':'sum', 'NOV':'sum', 'DEC':'sum'}).compute()

REPORT.to_csv('VALUE*.csv', header=True)

It takes 6 minutes to create a 100MB CSV file.

Upvotes: 1

Views: 788

Answers (1)

gherka
gherka

Reputation: 1446

Looking through Dask documentation, it says there that, "generally speaking, Dask.dataframe groupby-aggregations are roughly same performance as Pandas groupby-aggregations." So unless you're using a Dask distributed client to manage workers, threads, etc., the benefit from using it over vanilla Pandas isn't always there.

Also, try to time each step in your code because if the bulk of the 6 minutes is taken up by writing the .CSV to file on disk, then again Dask will be of no help (for a single file).

Here's a nice tutorial from Dask on adding distributed schedulers for your tasks.

Upvotes: 1

Related Questions