Read with dask.dataframe when file not accessible from local machine

Question

I have one powerful machine(remote machine), accessible through SSH. My data is stored at remote machine.

I want to run & access data on the remote machine. For this, I ran a dask-scheduler & a dask-worker on the remote machine. Then I ran a jupyter notebook on my laptop (local machine) with client=Client(‘schedular-ip:8786’), but it still refer data on the local machine, not of the remote machine.

How do I refer to data of the remote machine from notebook, running on the local machine?

import dask.dataframe as dd
from dask.distributed import Client

client = Client('remote-ip:8786')

ddf = dd.read_csv(
    'remote-machine-file.csv',
    header=None,
    assume_missing=True,
    dtype=object,
)

It fails with

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
 in 
----> 1 ddf = dd.read_csv('remote-machine-file.csv', header=None, assume_missing=True, dtype=object)

/usr/local/conda/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, lineterminator, compression, sample, sample_rows, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    735             storage_options=storage_options,
    736             include_path_column=include_path_column,
--> 737             **kwargs,
    738         )
    739 

/usr/local/conda/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, lineterminator, compression, sample, sample_rows, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    520 
    521         # Infer compression from first path
--> 522         compression = infer_compression(paths[0])
    523 
    524     if blocksize == "default":

IndexError: list index out of range

Read with dask.dataframe when file not accessible from local machine

Answers (1)

Related Questions