Reputation: 2158
Reading the instructions under this repo: Google Cloud Storage and BigQuery connectors I followed the below initialization action to create a new Dataproc cluster with a specific version of Google Cloud Storage and BigQuery connector installed:
gcloud beta dataproc clusters create christos-test \
--region europe-west1 \
--subnet <a subnet zone> \
--optional-components=ANACONDA,JUPYTER \
--enable-component-gateway \
--initialization-actions gs://<bucket-name>/init-scripts/v.0.0.1/connectors.sh \
--metadata gcs-connector-version=1.9.16 \
--metadata bigquery-connector-version=0.13.16 \
--zone europe-west1-b \
--master-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image=<an-image> \
--project=<a-project-id> \
--service-account=composer-dev@vf-eng-ca-nonlive.iam.gserviceaccount.com \
--no-address \
--max-age=5h \
--max-idle=1h \
--labels=<owner>=christos,<team>=group \
--tags=allow-internal-dataproc-dev,allow-ssh-from-management-zone,allow-ssh-from-management-zone2 \
--properties=core:fs.gs.implicit.dir.repair.enable=false
As you should be able to see, I had to add the external dependencies in a bucket of my own under: gs://init-dependencies-big-20824/init-scripts/v.0.0.1/connectors.sh
. As per the scipt's instructions (I am referring to the connector.sh
script), I also had to add the following jars in this bucket:
The script works fine and the cluster is created successfully. However, using a PySpark
notebook through Jupyter
still results in a BigQuery
"class not found" exception. The same happens when I run PySpark
directly from the terminal. The only way I was able to avoid that exception is by copying another jar
(this time spark-bigquery_2.11-0.8.1-beta-shaded.jar
) in my cluster's master node and starting PySpark
with:
pyspark --jars spark-bigquery_2.11-0.8.1-beta-shaded.jar
Obviously, this beats the purpose.
What am I doing wrong? I thought about changing the connector.sh
script to include another copy
function so copy spark-bigquery_2.11-0.8.1-beta-shaded.jar
under /usr/lib/hadoop/lib
, so I tried to just copy this jar
there manually and start PySpark
but this still didn't work...
Upvotes: 3
Views: 1717
Reputation: 4457
Connectors init action applies only to Cloud Storage and BigQuery connectors for Hadoop from GoogleCloudDataproc/hadoop-connectors.
Generally you should not use BigQuery connector for Hadoop if you are using Spark, because there are newer BigQuery connector for Spark in the spark-bigquery-connector repository that you already adding with --jars
parameter.
To install Spark BigQuery connector during cluster creation you will need to write your own initialization action that copies it in the /usr/lib/spark/jars/
directory on the cluster nodes. Note that you don't need to replicate all the code in the connectors init action, but just copy Spark BigQuery connector shaded jar from you Cloud Storage bucket to the /usr/lib/spark/jars/
directory:
gsutil cp gs://path/to/spark-bigquery-connector.jar /usr/lib/spark/jars/
Better approach could be to embed Spark BigQuery connector in your application distribution with other dependencies.
Connectors initialization action now supports Spark BigQuery connector, and can be used to install Spark BigQuery connector on Dataproc cluster during cluster creation:
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \
--metadata spark-bigquery-connector-version=0.15.1-beta
Upvotes: 4
Reputation: 31
Use Google public spark-lib that includes dependencies
--jars "gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
or
--jars "gs://spark-lib/bigquery/spark-bigquery-latest.jar
depending upon the Scala version that Dataproc cluster is deployed with
It works beautifully for me.
Upvotes: 1