Reputation: 2153
I have a databricks notebook that takes as an input the location of the table and then generates graphs.
I can run this notebook from a wrapper notebook for many different tables.
Is it possible that every time that the notebook run, I save it with the results as an html in the databricks files system.
In essence, I want to programmatically export the notebook, in the same way as I would manually do File>Export>HTML
Is that possible? if yes, how?
Note: I was thinking that ,if there is nothing out of the box, I guess that the notebooks will be saved somewhere internally in the driver. I could get it from there and move it where I want with dbutils.
Upvotes: 2
Views: 3961
Reputation: 2153
For the completeness of the post and after following what @Alex suggested.
I drop here the resulted code. What you need to do before, is to create a job that executes the notebook you want. Then you use the api to execute the job and get the result.
def _get_html_ouput(note_output: dict) -> str:
data = note_output['views']
# data = json.loads(note_output).get("views")
output_file_names = set()
for element in data:
if element.get("type", None).lower() != "notebook":
continue
output_file = element.get("name")
counter = 0
while output_file in output_file_names:
counter += 1
output_file = "%s_%d" % (output_file, counter)
output_file_names.add(output_file)
return element.get("content", "")
def run_DQ_visualization_and_save_html(table: str, id_col: str, snapshot_col: str = 'ReferenceDate') -> None:
time_executed = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S").replace('/', '_')
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
# context
url = context['extraContext']['api_url']
token = context['extraContext']['api_token']
jobs_instance = Jobs.Jobs(url, token) # initialize a jobs_instance
runs_job_id = jobs_instance.runJob(53 ** ** ** ** ** *1, 'notebook',
{'id_col': id_col,
"snapshot_col": snapshot_col, 'table': table})
print(
f"{table}: Running Job. You can check status at the following link: https://adb-***868882***.***.azuredatabricks.net/?o=***687797****#job/*****54767")
# check if the job is completed, then export the results.
run_is_not_completed = True
while run_is_not_completed:
current_run = [run for run in jobs_instance.runsList('completed')['runs'] if
run['run_id'] == runs_job_id['run_id'] and run['number_in_job'] == runs_job_id['number_in_job']]
if len(current_run) == 0:
time.sleep(30)
else:
run_is_not_completed = False
current_run = current_run[0]
print(f"{table}: Run has been completed")
print(f"{table}, Result state: " + current_run['state']['result_state'])
print(f"{table}, You can check the resulted output in the following link: {current_run['run_page_url']}")
note_output = jobs_instance.runsExport(runs_job_id['run_id'], 'CODE') # use the job_id to export content
notebook_result = _get_html_ouput(note_output)
date_str = datetime.datetime.now().strftime("%Y%m%d")
save_output_path = f"abfss://****@****2.dfs.core.windows.netpath/DQ/html_files/{date_str}/{table}.html"
dbutils.fs.put(save_output_path, notebook_result, overwrite=False)
print(f'{table}, Result save at: {save_output_path}')
It is working fine. However, databricks recently put a limit of 10mbs on what you can export. If you have plots then you can easily exceeed that and it results with an error. I don’t know why databricks added this size limit, it used to not be there and I could download much bigger html files.
Upvotes: 0
Reputation: 87279
IN general you can export notebook using either REST API, via the export
endpoint of workspace API - you can specify that you want to export as HTML. Another option is to use workspace export
command of the Databricks CLI that uses REST API under the hood, but it's easier to use.
But in your case, the notebook (most probably, if you use dbutils.notebook.run
) is executed as a separate job, so you need to use Runs Export API instead.
To call the API you need to have personal access token and host name, but it's easy to retrieve it programmatically from inside of the notebook. See this answer for exact details.
P.S. Notebook is not an object on file system, or something like - it exists only in memory, and not available on driver node. Maybe it will change with the upcoming Repos feature.
Upvotes: 1