Mateusz Kraiński
Mateusz Kraiński

Reputation: 95

DASK dataframe.to_csv storing files on worker instead of locally

I'm quite new to DASK, I'm trying to set up a distributed cluster on a private cloud. Right now I have the scheduler and one worker, both running in the same Docker container, on the same machine. They're started with dask-scheduler and dask-worker tcp://localhost:8786 respectively.

I'm connecting from my local machine to the scheduler. For the sake of simplicity, let's say that I'm running an IPython console locally, in a directory /home/my_user/local_directory. I'm running:

from dask.distributed import Client
client = Client('scheduler_host:scheduler_port')

This works fine. I can do some operations, schedule work, .compute() on dataframes also works as expected.

I'm having an issue when saving results to a file. When following the example from here and running:

import dask
import os
if not os.path.exists('data'):
    os.mkdir('data')
df = dask.datasets.timeseries()
df.to_csv('data/*.csv')

I'd expect that the csv files (1..30.csv) will be created in a local data directory, i.e. in /home/my_user/local_directory/data on my local machine. Instead, the files are saved on the scheduler/worker machine in a /home/my_user/local_directory/data directory. The same happens when replacing the last line with df.to_csv('data/test.csv', single_file=True).

Something more interesting happens when replacing that line with df.to_parquet('test.parquet'). In the parquet case, an empty test.parquet directory is created on my local machine and the results are stored in /home/my_user/local_directory/test.parquet on the scheduler/worker. It will also raise errors if the directory is not accessible locally.

According to this, running to_parquet should save the files locally. But according this, the files are created locally on the worker machine. If the second is true, why would the parquet directory be created locally? Why would the worker use my local path when storing the data?

Is this how it should be working? Perhaps I'm doing something wrong with the setup? Please advise! Thank you in advance!

Upvotes: 2

Views: 788

Answers (1)

MRocklin
MRocklin

Reputation: 57271

Dask dataframe storage functions save results from the workers. Typically people use Dask with a global file system, like NFS, HDFS, or a cloud object store.

If you want to store things locally then you should either use Dask on a single machine, or if your results are small you can call .compute to bring the results back to your local machine as a pandas dataframe, and then use the Pandas storage functions.

Upvotes: 4

Related Questions