Reputation: 2204
I am trying to setup Google cloud platform Airflow managed service ( Cloud composer) in such a way that it should be able to trigger tasks in workflow at my on-premises Hadoop cluster instead on google cloud. I am unable to find much information about this. Cloud composer documentation tells about triggering jobs on shared VPC in Google cloud but not with the on-premises one. Any help will be appreciated.
Upvotes: 1
Views: 923
Reputation: 91609
Cloud Composer runs its workers using CeleryExecutor pods within a GKE cluster. To trigger tasks in your on-premise infrastructure, you will need to configure the Composer environment such that the GKE cluster is reachable to/from your own network infrastructure, unless your infrastructure is accessible from the public internet.
To do this, consider investigating Google Cloud Hybrid Connectivity. You can use Cloud Interconnect and Cloud VPN to peer your on-premise infrastructure with a VPC, which you can configure Composer to use.
Upvotes: 1
Reputation: 589
Assuming you're running Spark, you could make use of the SparkSubmitOperator
in Airflow. The job (jar
or py
file) that will be submitted to Spark must connect to the IP address of your on-premise Hadoop cluster. Be aware that running this Spark job remotely will either require you to configure your firewall correctly (not recommended) or, indeed, run in a shared VPC. The latter creates a private network which contains both your cluster as well as your Airflow set-up, which allows both systems to securely communicate with each other.
Upvotes: 0