Reputation: 469
We have an Airflow DAG that involves running a pyspark job on Dataproc. We need a jdbc driver during the job, which I'd normally pass to the dataproc submit command:
gcloud dataproc jobs submit pyspark \
--cluster my-cluster \
--properties spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
--py-files ...
But how can I do it with Airflow's DataProcPySparkOperator?
For now we're adding this library to the cluster itself:
gcloud dataproc clusters create my-cluster \
--region global \
--zone europe-west1-d \
...
--properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
...
This seems to be working fine, but it doesn't feel like the right way to do it. Is there another way?
Upvotes: 0
Views: 2304
Reputation: 2158
I believe you want to pass dataproc_pyspark_properties
to the DataProcPySparkOperator
.
Upvotes: 1