Spinning up a Dataproc cluster with Spark BigQuery Connector

Question

Reading the instructions under this repo: Google Cloud Storage and BigQuery connectors I followed the below initialization action to create a new Dataproc cluster with a specific version of Google Cloud Storage and BigQuery connector installed:

gcloud beta dataproc clusters create christos-test \
--region europe-west1 \
--subnet  \
--optional-components=ANACONDA,JUPYTER \
--enable-component-gateway \
--initialization-actions gs:///init-scripts/v.0.0.1/connectors.sh \
--metadata gcs-connector-version=1.9.16 \
--metadata bigquery-connector-version=0.13.16 \
--zone europe-west1-b \
--master-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image= \
--project= \
--service-account=composer-dev@vf-eng-ca-nonlive.iam.gserviceaccount.com \
--no-address \
--max-age=5h \
--max-idle=1h \
--labels==christos,=group \
--tags=allow-internal-dataproc-dev,allow-ssh-from-management-zone,allow-ssh-from-management-zone2 \

--properties=core:fs.gs.implicit.dir.repair.enable=false

As you should be able to see, I had to add the external dependencies in a bucket of my own under: gs://init-dependencies-big-20824/init-scripts/v.0.0.1/connectors.sh. As per the scipt's instructions (I am referring to the connector.sh script), I also had to add the following jars in this bucket:

gcs-connector-hadoop2-1.9.16.jar
gcs-connector-1.7.0-hadoop2.jar
gcs-connector-1.8.0-hadoop2.jar
bigquery-connector-hadoop2-0.13.16.jar

The script works fine and the cluster is created successfully. However, using a PySpark notebook through Jupyter still results in a BigQuery "class not found" exception. The same happens when I run PySpark directly from the terminal. The only way I was able to avoid that exception is by copying another jar (this time spark-bigquery_2.11-0.8.1-beta-shaded.jar) in my cluster's master node and starting PySpark with:

pyspark --jars spark-bigquery_2.11-0.8.1-beta-shaded.jar

Obviously, this beats the purpose.

What am I doing wrong? I thought about changing the connector.sh script to include another copy function so copy spark-bigquery_2.11-0.8.1-beta-shaded.jar under /usr/lib/hadoop/lib, so I tried to just copy this jar there manually and start PySpark but this still didn't work...

Igor Dvorzhak · Accepted Answer

Connectors init action applies only to Cloud Storage and BigQuery connectors for Hadoop from GoogleCloudDataproc/hadoop-connectors.

Generally you should not use BigQuery connector for Hadoop if you are using Spark, because there are newer BigQuery connector for Spark in the spark-bigquery-connector repository that you already adding with --jars parameter.

To install Spark BigQuery connector during cluster creation you will need to write your own initialization action that copies it in the /usr/lib/spark/jars/ directory on the cluster nodes. Note that you don't need to replicate all the code in the connectors init action, but just copy Spark BigQuery connector shaded jar from you Cloud Storage bucket to the /usr/lib/spark/jars/ directory:

gsutil cp gs://path/to/spark-bigquery-connector.jar /usr/lib/spark/jars/

Better approach could be to embed Spark BigQuery connector in your application distribution with other dependencies.

Update

Connectors initialization action now supports Spark BigQuery connector, and can be used to install Spark BigQuery connector on Dataproc cluster during cluster creation:

REGION=
CLUSTER_NAME=
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \
    --metadata spark-bigquery-connector-version=0.15.1-beta

Spinning up a Dataproc cluster with Spark BigQuery Connector

Answers (2)

Update

Related Questions