Reputation: 419
In below code, why dd.read_csv is running on cluster? client.read_csv should run on cluster.
import dask.dataframe as dd
from dask.distributed import Client
client=Client('10.31.32.34:8786')
dd.read_csv('file.csv',blocksize=10e7)
dd.compute()
Is it the case that once I make a client object, all api calls will run on cluster?
Upvotes: 0
Views: 785
Reputation: 28673
In addition to the first answer:
yes, creating a client to a distributed client will make that be the default scheduler for all following dask work. You can, however, specify where you would like work to run as follows
for a specific compute,
dd.compute(scheduler='threads')
for a black of code,
with dask.config.set(scheduler='threads'):
dd.compute()
until further notice,
dask.config.set(scheduler='threads')
dd.compute()
See http://dask.pydata.org/en/latest/scheduling.html
Upvotes: 1
Reputation: 57261
The commnad dd.read_csv('file.csv', blocksize==1e8)
will generate many pd.read_csv(...)
commands, each of which will run on your dask workers. Each task will look for the file.csv file, seek to some location within that file defined by your blocksize, and read those bytes to create a pandas dataframe. The file.csv file should be universally present for each worker.
It is common for people to use files that are on some universally available storage, like a network file system, database, or cloud object store.
Upvotes: 2