Christos Hadjinikolis
Christos Hadjinikolis

Reputation: 2158

Spinning up a Dataproc cluster with Spark BigQuery Connector

Reading the instructions under this repo: Google Cloud Storage and BigQuery connectors I followed the below initialization action to create a new Dataproc cluster with a specific version of Google Cloud Storage and BigQuery connector installed:

gcloud beta dataproc clusters create christos-test \
--region europe-west1 \
--subnet <a subnet zone> \
--optional-components=ANACONDA,JUPYTER \
--enable-component-gateway \
--initialization-actions gs://<bucket-name>/init-scripts/v.0.0.1/connectors.sh \
--metadata gcs-connector-version=1.9.16 \
--metadata bigquery-connector-version=0.13.16 \
--zone europe-west1-b \
--master-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image=<an-image> \
--project=<a-project-id> \
--service-account=composer-dev@vf-eng-ca-nonlive.iam.gserviceaccount.com \
--no-address \
--max-age=5h \
--max-idle=1h \
--labels=<owner>=christos,<team>=group \
--tags=allow-internal-dataproc-dev,allow-ssh-from-management-zone,allow-ssh-from-management-zone2 \

--properties=core:fs.gs.implicit.dir.repair.enable=false

As you should be able to see, I had to add the external dependencies in a bucket of my own under: gs://init-dependencies-big-20824/init-scripts/v.0.0.1/connectors.sh. As per the scipt's instructions (I am referring to the connector.sh script), I also had to add the following jars in this bucket:

The script works fine and the cluster is created successfully. However, using a PySpark notebook through Jupyter still results in a BigQuery "class not found" exception. The same happens when I run PySpark directly from the terminal. The only way I was able to avoid that exception is by copying another jar (this time spark-bigquery_2.11-0.8.1-beta-shaded.jar) in my cluster's master node and starting PySpark with:

pyspark --jars spark-bigquery_2.11-0.8.1-beta-shaded.jar

Obviously, this beats the purpose.

What am I doing wrong? I thought about changing the connector.sh script to include another copy function so copy spark-bigquery_2.11-0.8.1-beta-shaded.jar under /usr/lib/hadoop/lib, so I tried to just copy this jar there manually and start PySpark but this still didn't work...

Upvotes: 3

Views: 1717

Answers (2)

Igor Dvorzhak
Igor Dvorzhak

Reputation: 4457

Connectors init action applies only to Cloud Storage and BigQuery connectors for Hadoop from GoogleCloudDataproc/hadoop-connectors.

Generally you should not use BigQuery connector for Hadoop if you are using Spark, because there are newer BigQuery connector for Spark in the spark-bigquery-connector repository that you already adding with --jars parameter.

To install Spark BigQuery connector during cluster creation you will need to write your own initialization action that copies it in the /usr/lib/spark/jars/ directory on the cluster nodes. Note that you don't need to replicate all the code in the connectors init action, but just copy Spark BigQuery connector shaded jar from you Cloud Storage bucket to the /usr/lib/spark/jars/ directory:

gsutil cp gs://path/to/spark-bigquery-connector.jar /usr/lib/spark/jars/

Better approach could be to embed Spark BigQuery connector in your application distribution with other dependencies.

Update

Connectors initialization action now supports Spark BigQuery connector, and can be used to install Spark BigQuery connector on Dataproc cluster during cluster creation:

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \
    --metadata spark-bigquery-connector-version=0.15.1-beta

Upvotes: 4

user2200901
user2200901

Reputation: 31

Use Google public spark-lib that includes dependencies

--jars "gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

or

--jars "gs://spark-lib/bigquery/spark-bigquery-latest.jar

depending upon the Scala version that Dataproc cluster is deployed with

It works beautifully for me.

Upvotes: 1

Related Questions