Reputation: 1018
In particular, how do I add the spark-bigquery-connector so that I can query data from within dataproc's Jupyter web interface?
Key links: - https://github.com/GoogleCloudPlatform/spark-bigquery-connector
Goal: To be able to run something like:
s = spark.read.bigquery("transactions")
s = (s
.where("quantity" >= 0)
.groupBy(f.col('date'))
.agg({'sales_amt':'sum'})
)
df = s.toPandas()
Upvotes: 3
Views: 6877
Reputation: 256
There are basically 2 ways to achieve what you want:
1 At Cluster creation:
You will have to creat an initialization script (param --initialization-actions
) to install you dependencies.
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions
2 At Cluster creation: You can specify a customized image to be used when creating your cluster. https://cloud.google.com/dataproc/docs/guides/dataproc-images
3 At job runtime:
You can pass the additional jar files when you run the job using the --jars
parameter:
https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/jobs/submit/pyspark#--jars
I recommend (3) if you have a simple .jar dependency to run, like scoop.jar
I recommend (1) if you have lots of packages to install before running your jobs. It gives you much more control.
Option (2) definitely gives you total control, but you will have to maintain the image yourself (apply patches, upgrade etc) so unless you really need it I don't recommend.
Upvotes: 3