Upload the PySpark dataframe to bigquery as a dataproc job

Question

I'm trying to submit a PySpark job on Dataproc cluster. My Pyspark job is uploading a dataframe to bigquery. When I do it using submit job on the cluster, I face an error, the job fails. But, when I provide this jar :
"gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar", in the jar file parameter in submit job, the job executes successfully.

What I wanted is to find a way to avoid providing this jar during run-time and just run the job by giving the location of .py file alone. How can I do it? Is it somehow possible to specify this jar within the .py file itself?

I used the below approach to provide the jar in the .py file itself but it doesn't seem to work.

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('yarn')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
.appName('df-to-bq-sample').enableHiveSupport().getOrCreate()

Can anyone suggest a way to do it, and I do not want to use CLI for this. Thank you!

Upload the PySpark dataframe to bigquery as a dataproc job

Answers (1)

Related Questions