Reputation: 1035
I have similar questions like below, but i wonder there is an existing library work nicely with airflow to create databricks cluster, return the cluster_id, and reuse for the downstream tasks
.
Triggering Databricks job from Airflow without starting new cluster
My study shows: the DatabricksHook
class has quite some nice methods and api calls, but it does not have calls to create cluster
and re-use the cluster in the same DAG.
if i have to add the methods myself:
In scala or other language, one could 'pimp' the library to add new methods to an 3rd party class. Any suggestion in python
to do elegant way of adding extra methods?
Databrickshook
into my project, and add the missing methodsDatabricksHook
class. This might take a longer time than i could wait for.Info:
Upvotes: 0
Views: 219
Reputation: 20097
Just a note Airflow 1.10 reached end-of-life on June 17 so you should switch to Airflow 2 as soon as possible as there will not be improvements nor even critical security fixes in 1.10.
In Airflow 2 you have Databricks provider and it has more methods (start_cluster/terminate_cluster). See https://airflow.apache.org/docs/apache-airflow-providers-databricks/stable/_api/airflow/providers/databricks/hooks/databricks/index.html#airflow.providers.databricks.hooks.databricks.DatabricksHook.restart_cluster for example.
It seems it would be possible to use those methods. You can easily write your own Operator using those Hook methods and possibly you can contribute the operator back.
Upvotes: 1