user8617180
user8617180

Reputation: 277

Execute bash script on a dataproc cluster from a composer

I wanted to add jars to a dataproc cluster in a specific location once the cluster has been created using a simple shell script.

I would like to automate this step to run from a composer once the dataproc cluster has been created,the next step is to execute bash script which would add the jars to the data proc cluster.

Can you suggest which airflow operator to use to execute bash scripts on the dataproc cluster?

Upvotes: 2

Views: 1883

Answers (1)

Dennis Huo
Dennis Huo

Reputation: 10707

For running a simple shell script on the master node, the easiest way would be to use a pig sh Dataproc job, such as the following:

gcloud dataproc jobs submit pig --cluster ${CLUSTER} --execute 'sh echo hello world'

or to use pig fs to copy the jarfile directly:

gcloud dataproc jobs submit pig --cluster ${CLUSTER} --execute 'fs -cp gs://foo/my_jarfile.jar file:///tmp/localjar.jar'

The equivalent Airflow operator setup for those gcloud commands would be using the DataProcPigOperator with the query string param.

If you need to place jarfiles on all the nodes, it's better to just use an initialization action to copy the jarfiles at cluster startup time:

#!/bin/bash
# copy-jars.sh

gsutil cp gs://foo/my-jarfile.jar /tmp/localjar.jar

If you need to dynamically determine what jarfiles to copy onto all nodes sometime after the cluster has already been deployed, you could take the approach described here to use an initialization action which continuously watches some hdfs directory for jarfiles to copy to a local directory, and then whenever you need a jarfile to appear on all the nodes there, you could just submit a pig fs job to place the jarfile from GCS into HDFS in the watched directory.

Generally you don't want something to automatically poll on GCS itself because GCS list requests cost money, whereas it's no extra cost to poll your Dataproc cluster's HDFS.

Upvotes: 3

Related Questions