Mano Bhargav
Mano Bhargav

Reputation: 49

Upload the PySpark dataframe to bigquery as a dataproc job

I'm trying to submit a PySpark job on Dataproc cluster. My Pyspark job is uploading a dataframe to bigquery. When I do it using submit job on the cluster, I face an error, the job fails. But, when I provide this jar :
"gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar", in the jar file parameter in submit job, the job executes successfully.

What I wanted is to find a way to avoid providing this jar during run-time and just run the job by giving the location of .py file alone. How can I do it? Is it somehow possible to specify this jar within the .py file itself?

I used the below approach to provide the jar in the .py file itself but it doesn't seem to work.

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('yarn')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
.appName('df-to-bq-sample').enableHiveSupport().getOrCreate()

Can anyone suggest a way to do it, and I do not want to use CLI for this. Thank you!

Upvotes: 2

Views: 546

Answers (1)

David Rabinowitz
David Rabinowitz

Reputation: 30448

First of all, the mentioned is a must when reading and writing to BigQuery. If you don't want to add it to the job submission, you can add the BigQuery connector jar on cluster creation using the connectors init action like this:

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \
    --metadata spark-bigquery-connector-version=0.24.2

Upvotes: 0

Related Questions