Maxime
Maxime

Reputation: 61

Import notebooks in Databricks

I am using databricks-connect and VS Code to develop some python code for Databricks.

I would like to code and run/test everything directly from VS Code using databricks-connect to avoid dealing with Databricks web IDE. For basic notebooks, it works just fine but I would like to do the same with multiple notebooks and use imports (e.g. use import config-notebook in another notebook).

However, in VS Code import another-notebook works fine but it does not work in Databricks. From what I could find, the alternative in Databricks is %run "another-notebook" but it does not work if I want to run that from VS Code (databricks-connect does not include notebook workflow).

Is there any way to make notebook imports that works both in Databricks and is supported by databricks-connect ?

Thanks a lot for your answers !

Upvotes: 3

Views: 6572

Answers (3)

chá de boldo
chá de boldo

Reputation: 56

Well, you can create packages .whl(wheel) install in the cluster and call via import in any notebook is a breeze

Upvotes: 0

Kashyap
Kashyap

Reputation: 17401

As described in How to import one databricks notebook into another?

The only way to import notebooks is by using the run command:

run /Shared/MyNotebook

or relative path:

%run ./MyNotebook

More details: https://docs.azuredatabricks.net/user-guide/notebooks/notebook-workflows.html

only way I can think of is to write conditional code that either uses import or run depending on where it's running.

Something like:

try:
    import another-notebook
    print("running in VS Code")
except ImportError:
    code = """
%run "another-notebook"
print("running in Databricks")
"""
    exec(code)


If you want to be more certain of the environment, perhaps you can use some info from the context. E.g. following code

for a in spark.sparkContext.__dict__:
  print(a, getattr(spark.sparkContext, a))

run in my cluster prints:

_accumulatorServer <pyspark.accumulators.AccumulatorServer object at 0x7f678d944cd0>
_batchSize 0
_callsite CallSite(function='__init__', file='/databricks/python_shell/scripts/PythonShellImpl.py', linenum=1569)
_conf <pyspark.conf.SparkConf object at 0x7f678d944c40>
_encryption_enabled False
_javaAccumulator PythonAccumulatorV2(id: 0, name: None, value: [])
_jsc org.apache.spark.api.java.JavaSparkContext@838f1fd
_pickled_broadcast_vars <pyspark.broadcast.BroadcastPickleRegistry object at 0x7f678e699c40>
_python_includes []
_repr_html_ <function apply_spark_ui_patch.<locals>.get_patched_repr_html_.<locals>.patched_repr_html_ at 0x7f678e6a54c0>
_temp_dir /local_disk0/spark-fd8657a8-79a1-4fb0-b6fc-c68763f0fcd5/pyspark-3718c30e-c265-4e68-9a23-b003f4532576
_unbatched_serializer PickleSerializer()
appName Databricks Shell
environment {'PYTHONHASHSEED': '0'}
master spark://10.2.2.8:7077
profiler_collector None
pythonExec /databricks/python/bin/python
pythonVer 3.8
serializer AutoBatchedSerializer(PickleSerializer())
sparkHome /databricks/spark

So e.g. your condition could be:

if spark.sparkContext.appName.contains("Databricks"):
    code = """
%run "another-notebook"
print("running in Databricks")
"""
    exec(code)
else:
    import another-notebook
    print("running in VS Code")

Upvotes: 0

Maxime
Maxime

Reputation: 61

I found a solution that completes the part mentioned by @Kashyap with try ... except.

The python file of a notebook that contains a %run command should look like this :

# Databricks notebook source
# MAGIC %run "another-notebook"

# COMMAND ----------

try:
    import another-notebook
except ModuleNotFoundError:
    print("running on Databricks")

import standard-python-lib

# Some very interesting code

The # MAGIC %run avoids having SyntaxError while executing it in Python and tells Databricks it is a Magic command in a Python notebook. That way, whether the script is executed in Python via databricks-connect or in Databricks, it will work.

Upvotes: 3

Related Questions