Dhruv Kumar
Dhruv Kumar

Reputation: 419

Confusion regarding cluster scheduler and single machine distributed scheduler

In below code, why dd.read_csv is running on cluster? client.read_csv should run on cluster.

import dask.dataframe as dd
from dask.distributed import Client

client=Client('10.31.32.34:8786')
dd.read_csv('file.csv',blocksize=10e7)
dd.compute()

Is it the case that once I make a client object, all api calls will run on cluster?

Upvotes: 0

Views: 785

Answers (2)

mdurant
mdurant

Reputation: 28673

In addition to the first answer:

yes, creating a client to a distributed client will make that be the default scheduler for all following dask work. You can, however, specify where you would like work to run as follows

  • for a specific compute,

    dd.compute(scheduler='threads')
    
  • for a black of code,

    with dask.config.set(scheduler='threads'):
        dd.compute()
    
  • until further notice,

    dask.config.set(scheduler='threads')
    dd.compute()
    

See http://dask.pydata.org/en/latest/scheduling.html

Upvotes: 1

MRocklin
MRocklin

Reputation: 57261

The commnad dd.read_csv('file.csv', blocksize==1e8) will generate many pd.read_csv(...) commands, each of which will run on your dask workers. Each task will look for the file.csv file, seek to some location within that file defined by your blocksize, and read those bytes to create a pandas dataframe. The file.csv file should be universally present for each worker.

It is common for people to use files that are on some universally available storage, like a network file system, database, or cloud object store.

Upvotes: 2

Related Questions