Implementing Dask scheduler and workers on Docker containers

Question

I need to run a scikit-learn RandomForestClassifier with multiple processes in parallel. For that, I'm looking into implementing a Dask scheduler with N workers, where the scheduler and each worker run in a separate Docker container.

The client application, that also runs in a separate Docker container, will first connect to the scheduler and initiate the scikit-learn process with with joblib.parallel_backend('dask'):.

The data to train the machine learning model is stored in parquet in the client application Docker container. What is the best practice to have the workers access the data? Should the data be located somewhere else, in a shared directory?

jordanvrtanoski · Accepted Answer

Since Apach Parquet is a files-system based, it all depends on the architecture that you are building, meaning will your project run on a single server or it will be distributed across multiple servers.

If you are running on a single server, then simple share of the docker volume between the containers, or even a common mount over the local file storage will do the job.

If, on the other hand, you are trying to set up a distributed training over multiple servers, then you will need some type of file server to handle the files.

One of the simple ways to share files is trough an NFS server and one of the commonly used images for this is erichough/nfs-server. You will need to use this container to export the local folder(s) where the files are stored and you will need to mount the fs on the remaining servers.

Implementing Dask scheduler and workers on Docker containers

Answers (1)

Related Questions