Mahi
Mahi

Reputation: 107

Return a dataframe from another notebook in databricks

I have a notebook which will process the file and creates a data frame in structured format. Now I need to import that data frame created in another notebook, but the problem is before running the notebook I need to validate that only for some scenarios I need to run.

Usually to import all data structures, we use %run. But in my case it should be combinations of if clause and then notebook run

if "dataset" in path": %run ntbk_path

its giving an error " path not exist"

if "dataset" in path": dbutils.notebook.run(ntbk_path)

this one I cannot get all the data structures.

Can someone help me to resolve this error?

Upvotes: 6

Views: 16314

Answers (1)

Alex Ott
Alex Ott

Reputation: 87174

To implement it correctly you need to understand how things are working:

  • %run is a separate directive that should be put into the separate notebook cell, you can't mix it with the Python code. Plus, it can't accept the notebook name as variable. What %run is doing - it's evaluating the code from specified notebook in the context of the current Spark session, so everything that is defined in that notebook - variables, functions, etc. is available in the caller notebook.
  • dbutils.notebook.run is a function that may take a notebook path, plus parameters and execute it as a separate job on the current cluster. Because it's executed as a separate job, then it doesn't share the context with current notebook, and everything that is defined in it won't be available in the caller notebook (you can return a simple string as execution result, but it has a relatively small max length). One of the problems with dbutils.notebook.run is that scheduling of a job takes several seconds, even if the code is very simple.

How you can implement what you need?

  • if you use dbutils.notebook.run, then in the called notebook you can register a temp view, and caller notebook can read data from it (examples are adopted from this demo)

Called notebook (Code1 - it requires two parameters - name for view name & n - for number of entries to generate):

name = dbutils.widgets.get("name")
n = int(dbutils.widgets.get("n"))
df = spark.range(0, n)
df.createOrReplaceTempView(name)

Caller notebook (let's call it main):

if "dataset" in "path": 
  view_name = "some_name"
  dbutils.notebook.run(ntbk_path, 300, {'name': view_name, 'n': "1000"})
  df = spark.sql(f"select * from {view_name}")
  ... work with data
  • it's even possible to do something like with %run, but it could require a kind of "magic". The foundation of it is the fact that you can pass arguments to the called notebook by using the $arg_name="value", and you can even refer to the values specified in the widgets. But in any case, the check for value will happen in the called notebook.

The called notebook could look as following:

flag = dbutils.widgets.get("generate_data")
dataframe = None
if flag == "true":
  dataframe = ..... create datarame

and the caller notebook could look as following:

------ cell in python
if "dataset" in "path": 
  gen_data = "true"
else:
  gen_data = "false"
dbutils.widgets.text("gen_data", gen_data)

------- cell for %run
%run ./notebook_name $generate_data=$gen_data

------ again in python
dbutils.widgets.remove("gen_data") # remove widget
if dataframe: # dataframe is defined
  do something with dataframe

Upvotes: 10

Related Questions