user15964
user15964

Reputation: 2639

extra keywords for dask.compute

In jupyterlab, ?dask.compute will show

Signature: dask.compute(*args, **kwargs)
Docstring:
Compute several dask collections at once.

Parameters
----------
args : object
    Any number of objects. If it is a dask object, it's computed and the
    result is returned. By default, python builtin collections are also
    traversed to look for dask objects (for more information see the
    ``traverse`` keyword). Non-dask arguments are passed through unchanged.
traverse : bool, optional
    By default dask traverses builtin python collections looking for dask
    objects passed to ``compute``. For large collections this can be
    expensive. If none of the arguments contain any dask objects, set
    ``traverse=False`` to avoid doing this traversal.
scheduler : string, optional
    Which scheduler to use like "threads", "synchronous" or "processes".
    If not provided, the default is to check the global settings first,
    and then fall back to the collection defaults.
optimize_graph : bool, optional
    If True [default], the optimizations for each collection are applied
    before computation. Otherwise the graph is run as is. This can be
    useful for debugging.
kwargs
    Extra keywords to forward to the scheduler function.

It says, kwargs is "Extra keywords to forward to the scheduler function.". But how could I know what extra keywords can be used here?

Upvotes: 2

Views: 368

Answers (2)

user15964
user15964

Reputation: 2639

I found a doc page scheduler-overview which is not in the TOC of dask doc. It mentions four get function and said

dask.threaded.get: a scheduler backed by a thread pool

dask.multiprocessing.get: a scheduler backed by a process pool

dask.get: a synchronous scheduler, good for debugging

distributed.Client.get: a distributed scheduler for executing graphs on multiple machines. This lives in the external distributed project.

For more information on the individual options for each scheduler, see the docstrings for each scheduler get function.

So, in the jupyterlab, if we type ?dask.multiprocessing.get, we got

Signature:
dask.multiprocessing.get(
    dsk,
    keys,
    num_workers=None,
    func_loads=None,
    func_dumps=None,
    optimize_graph=True,
    pool=None,
    chunksize=None,
    **kwargs,
)

so we can know num_workers, chunksize, etc can be used in compute

Upvotes: 2

SultanOrazbayev
SultanOrazbayev

Reputation: 16561

As per comment by @furas, for an exhaustive list of arguments you will need to examine the source code. The relevant documentation is to look at the distributed API, especially for client.submit and client.compute entries.

However, in practice, the ones that I tend to use are:

  • resources, specifying specific resources needed for the task (e.g. resources={"foo": 1} to make each task use 1 unit of some resource "foo")
  • priority, specifying task priority (e.g. priority=-10 to make this task less important relative to others)
  • key, this one has to be a unique name per task and I use it rarely, only to have a custom representation of tasks in the dashboard when debugging/monitoring long-running tasks.

Upvotes: 2

Related Questions