Icarus
Icarus

Reputation: 1463

Connect local Jupyter Hub to Azure Databricks Spark Cluster

I have one Azure Databricks cluster. Although it provides notebook, my team is more familiar with using Jupyter Lab which they can upload offline-csv, install python packages. I want to setup a Jupyter lab which can connect to the Spark Cluster.

Although databricks allow using remote kernel to access it - https://databricks.com/blog/2019/12/03/jupyterlab-databricks-integration-bridge-local-and-remote-workflows.html, it can't read local files on Jupyter lab.

Is there any way to use spark cluster with a local jupyter lab like https://medium.com/ibm-data-ai/connect-to-remote-kerberized-hive-from-a-local-jupyter-notebook-to-run-sql-queries-83d5e548d82c? Many thanks

Upvotes: 0

Views: 682

Answers (1)

Ecstasy
Ecstasy

Reputation: 1864

If you prefix a magic command with a %%, it will take the rest of the cell as its argument, which means %%local is used to send data to Spark cluster from Local instance.

Install databrickslabs_jupyterlab locally:

(base)$ conda create -n dj python=3.8  # you might need to add "pywin32" if you are on Windows
(base)$ conda activate dj
(dj)$   pip install --upgrade databrickslabs-jupyterlab[cli]==2.2.1
(db-jlab)$ dj $PROFILE -k

Start JupyterLab:

(db-jlab)$ dj $PROFILE -l 

Test the Spark access:

import socket

from databrickslabs_jupyterlab import is_remote

result = sc.range(10000).repartition(100).map(lambda x: x).sum()
print(socket.gethostname(), is_remote())
print(result)

For more details, you can refer to Install Jupyter Notebook on your computer and connect to Apache Spark on HDInsight, Kernels for Jupyter Notebook on Apache Spark clusters in Azure HDInsight and Sending data to Spark cluster from Local instance

Upvotes: 0

Related Questions