Myccha
Myccha

Reputation: 1018

How to add jar dependency to dataproc cluster in GCP?

In particular, how do I add the spark-bigquery-connector so that I can query data from within dataproc's Jupyter web interface?

Key links: - https://github.com/GoogleCloudPlatform/spark-bigquery-connector

Goal: To be able to run something like:

s = spark.read.bigquery("transactions")

s = (s
    .where("quantity" >= 0)
    .groupBy(f.col('date'))
    .agg({'sales_amt':'sum'})
     )

df = s.toPandas()

Upvotes: 3

Views: 6877

Answers (1)

Alan Borsato
Alan Borsato

Reputation: 256

There are basically 2 ways to achieve what you want:

1 At Cluster creation: You will have to creat an initialization script (param --initialization-actions) to install you dependencies. https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions

2 At Cluster creation: You can specify a customized image to be used when creating your cluster. https://cloud.google.com/dataproc/docs/guides/dataproc-images

3 At job runtime: You can pass the additional jar files when you run the job using the --jars parameter: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/jobs/submit/pyspark#--jars

I recommend (3) if you have a simple .jar dependency to run, like scoop.jar

I recommend (1) if you have lots of packages to install before running your jobs. It gives you much more control.

Option (2) definitely gives you total control, but you will have to maintain the image yourself (apply patches, upgrade etc) so unless you really need it I don't recommend.

Upvotes: 3

Related Questions