soMuchToLearnAndShare
soMuchToLearnAndShare

Reputation: 1035

'pimp' airflow databricks_hook or some python library to create and get the cluster_id for downstream tasks

I have similar questions like below, but i wonder there is an existing library work nicely with airflow to create databricks cluster, return the cluster_id, and reuse for the downstream tasks.

Triggering Databricks job from Airflow without starting new cluster

My study shows: the DatabricksHook class has quite some nice methods and api calls, but it does not have calls to create cluster and re-use the cluster in the same DAG.

if i have to add the methods myself: In scala or other language, one could 'pimp' the library to add new methods to an 3rd party class. Any suggestion in python to do elegant way of adding extra methods?

Info:

Upvotes: 0

Views: 219

Answers (1)

Jarek Potiuk
Jarek Potiuk

Reputation: 20097

Just a note Airflow 1.10 reached end-of-life on June 17 so you should switch to Airflow 2 as soon as possible as there will not be improvements nor even critical security fixes in 1.10.

In Airflow 2 you have Databricks provider and it has more methods (start_cluster/terminate_cluster). See https://airflow.apache.org/docs/apache-airflow-providers-databricks/stable/_api/airflow/providers/databricks/hooks/databricks/index.html#airflow.providers.databricks.hooks.databricks.DatabricksHook.restart_cluster for example.

It seems it would be possible to use those methods. You can easily write your own Operator using those Hook methods and possibly you can contribute the operator back.

Upvotes: 1

Related Questions