Reputation: 21
I am new to Dask
. I was using it with an xarray
dataset. I persisted the dataset in memory and the jupyter
cell showed that it was ready (no more asterisk). But the dask
dashboard was busy executing the task. I didn't understand. When this happens, should I wait till dask
dashboard has stabilized or am I free to run the next cell?
Upvotes: 0
Views: 330
Reputation: 15442
No, you do not need to wait. Isn't that great?
xarray and dask work together to schedule tasks, and dask's scheduler automatically tracks the dependency graph of all tasks.
If a local xarray operation requires the work to be completed, dask will block execution until all needed operations are complete. For example, plotting will immediately trigger computation and will bring all necessary data to the notebook so it can be rendered. Similarly, calling compute()
or executing another operation which requires the data to be local or executed, such as writing to a file, plotting, coercing to a native python or numpy data type (e.g. da.values
or bool(da.isnull().any())
) will all trigger computation on the workers, and the operation will block until the necessary tasks have completed. Conversely, distributed operations which only return a dask-backed xarray object will only do enough computing locally to schedule the operations on the cluster.
Persist operations are a special case where the task graph is executed on the workers, but the result is not brought to the local notebook, and the notebook may continue executing while the work is done. If you'd like to wait for the work to complete, you can always block progress with dd.wait()
. But you don't necessarily need to - if you execute a later step which does need the data, this step will block until the computations are complete.
See the docs on using xarray with dask, Dask's Lazy Execution tutorial, and Dask: Best Practices for more info.
Upvotes: 1
Reputation: 759
Persist submits the task graph to the scheduler and returns future objects pointing to your results. So the calculation will be running in the background while you continue your work. You don't need to wait for the dashboard to finish.
Upvotes: 0
Reputation: 16561
If the subsequent code doesn't depend on the task being calculated, it's possible to continue running some code.
For example, let's say you submitted calculation of ddf.map_partitions(some_func)
and persist the dataframe:
ddf = ddf.map_partitions(some_func)
ddf = ddf.persist()
Then, if you try to run some ddf
-specific code, Jupyter will wait until ddf
(or the relevant chunks/partitions) are computed by the scheduler, e.g. if you try to run len(ddf)
it will block until the full ddf
is computed.
If the subsequent code does not require submission to the scheduler, then it will run without waiting for ddf
. For example, if downstream you want to create a list of some numbers this will run right away:
list_evens = [i for i in range(1000) if i%2==0]
Upvotes: 1