How to add jar dependency to dataproc cluster in GCP?

Question

In particular, how do I add the spark-bigquery-connector so that I can query data from within dataproc's Jupyter web interface?

Key links: - https://github.com/GoogleCloudPlatform/spark-bigquery-connector

Goal: To be able to run something like:

s = spark.read.bigquery("transactions")

s = (s
    .where("quantity" >= 0)
    .groupBy(f.col('date'))
    .agg({'sales_amt':'sum'})
     )

df = s.toPandas()

Alan Borsato · Accepted Answer

There are basically 2 ways to achieve what you want:

1 At Cluster creation: You will have to creat an initialization script (param --initialization-actions) to install you dependencies. https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions

2 At Cluster creation: You can specify a customized image to be used when creating your cluster. https://cloud.google.com/dataproc/docs/guides/dataproc-images

3 At job runtime: You can pass the additional jar files when you run the job using the --jars parameter: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/jobs/submit/pyspark#--jars

I recommend (3) if you have a simple .jar dependency to run, like scoop.jar

I recommend (1) if you have lots of packages to install before running your jobs. It gives you much more control.

Option (2) definitely gives you total control, but you will have to maintain the image yourself (apply patches, upgrade etc) so unless you really need it I don't recommend.

How to add jar dependency to dataproc cluster in GCP?

Answers (1)

Related Questions