Reputation: 649
I have been working on Databricks notebook using Python/ R. Once job is done we need to terminate the cluster to save cost involved. ( As we are utilizing the machine).
So we also have to start the cluster if we want to work on any notebook. I have seen it takes a lot of time and install the packages again in the cluster. Is there any way to avoid installation everytime we start cluster?
Upvotes: 4
Views: 3792
Reputation: 3751
Update: Databricks now allows custom docker containers.
Unfortunately not.
When you terminate a cluster its memory state is lost, so when you start it again it comes with a clean image. Even if you add the desired packages into an init script they will have to be installed each initialization.
You may ask Databricks support to check if it is possible to create a custom cluster image for you.
Upvotes: 2
Reputation: 171
I am using conda env to install the packages. After my 1st installation, I am saving the environment as a yaml file in dbfs and using the same yaml file in the all other runs. This way I don't have to install the packages again.
Save the environment as a conda YAML specification.
%conda env export -f /dbfs/filename.yml
Import the file to another notebook using conda env update.
%conda env update -f /dbfs/filename.yml
List the packages -
%conda list
Upvotes: 0