ambigus9
ambigus9

Reputation: 1569

How to see progress of Dask compute task?

I would like to see a progress bar on Jupyter notebook while I'm running a compute task using Dask, I'm counting all values of id column from a large csv file +4GB, so any ideas?

import dask.dataframe as dd

df = dd.read_csv('data/train.csv')
df.id.count().compute()

Upvotes: 36

Views: 29573

Answers (3)

BND
BND

Reputation: 678

Bellow will show remaining time and items

from tqdm.dask import TqdmCallback

with TqdmCallback(desc="compute"):
    ...
    arr.compute()

# or use callback globally
cb = TqdmCallback(desc="global")
cb.register()
arr.compute()

https://github.com/tqdm/tqdm#dask-integration

https://github.com/tqdm/tqdm#dask-integration:~:text=from%20tqdm.dask%20import%20tqdmcallback%20with%20tqdmcallback(desc%3D%22compute%22)%3A%20...%20arr.compute()%20%23%20or%20use%20callback%20globally%20cb%20%3D%20tqdmcallback(desc%3D%22global%22)%20cb.register()%20arr.compute()

Upvotes: 1

avriiil
avriiil

Reputation: 382

This resource provides full-code examples for both cases (local and distributed) and more detailed information about using the Dask Dashboard.

Note that when working in Jupyter notebooks you may have to separate the ProgressBar().register() call and the computation call you want to track (e.g. df.set_index('id').persist()) into two separate cells for the progress bar to actually appear.

DO:

enter image description here

DON'T DO:

enter image description here

Upvotes: 0

MRocklin
MRocklin

Reputation: 57281

If you're using the single machine scheduler then do this:

from dask.diagnostics import ProgressBar
ProgressBar().register()

http://dask.pydata.org/en/latest/diagnostics-local.html

If you're using the distributed scheduler then do this:

from dask.distributed import progress

result = df.id.count.persist()
progress(result)

Or just use the dashboard

http://dask.pydata.org/en/latest/diagnostics-distributed.html

Upvotes: 48

Related Questions