Nira
Nira

Reputation: 469

Pass packages to pyspark running on dataproc from airflow?

We have an Airflow DAG that involves running a pyspark job on Dataproc. We need a jdbc driver during the job, which I'd normally pass to the dataproc submit command:

gcloud dataproc jobs submit pyspark \
--cluster my-cluster \
--properties spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
--py-files ...

But how can I do it with Airflow's DataProcPySparkOperator?

For now we're adding this library to the cluster itself:

gcloud dataproc clusters create my-cluster \
  --region global \
  --zone europe-west1-d \
  ...
  --properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
  ...

This seems to be working fine, but it doesn't feel like the right way to do it. Is there another way?

Upvotes: 0

Views: 2304

Answers (1)

tix
tix

Reputation: 2158

I believe you want to pass dataproc_pyspark_properties to the DataProcPySparkOperator.

See: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataproc_operator.py

Upvotes: 1

Related Questions