SARose
SARose

Reputation: 3725

How to use Dask on Databricks

I want to use Dask on Databricks. It should be possible (I cannot see why not). If I import it, one of two things happens, either I get an ImportError but when I install distributed to solve this DataBricks just says Cancelled without throwing any errors.

Upvotes: 9

Views: 6743

Answers (2)

Jacob Tomlinson
Jacob Tomlinson

Reputation: 3773

There is now a dask-databricks package from the Dask community which makes running Dask clusters alongside Spark/Photon on multi-node Databricks quick to set up. This way you can run one cluster and then use either framework on the same infrastructure.

You create an init script that installs dask-databricks and uses a Dask CLI command to start the Dask cluster components.

#!/bin/bash

# Install Dask + Dask Databricks
/databricks/python/bin/pip install --upgrade dask[complete] dask-databricks

# Start Dask cluster components
dask databricks run

Then in your Databricks Notebook you can get a Dask client object using the dask_databricks.get_client() utility.

import dask_databricks

client = dask_databricks.get_client()

It also sets up access to the Dask dashboard via the Databricks web proxy.

Upvotes: 11

mdurant
mdurant

Reputation: 28673

I don't think we have heard of anyone using Dask under databricks, but so long as it's just python, it may well be possible.

The default scheduler for Dask is threads, and this is the most likely thing to work. In this case you don't even need to install distributed.

For the Cancelled error, it sounds like you are using distributed, and, at a guess, the system is not allowing you to start extra processes (you could test this with the subprocess module). To work around, you could do

client = dask.distributed.Client(processes=False)

Of course, if it is indeed the processes that you need, this would not be great. Also, I have no idea how you might expose the dashboard's port.

Upvotes: 1

Related Questions