How to connect PySpark to Bigquery

Question

I am trying to read a table form BigQuery using PySpark.

I have tried the following

table = 'my-project-id.project-dataset.test_table_spark'
df = spark.read.format('bigquery').option('table', table).load()

However, I am getting this error

: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html

How can I read the bigQuery table from pySpark (at the moment I'm using python2)

Brad Miro · Accepted Answer

You need to include the jar for the spark-bigquery-connector with your spark-submit. The easiest way to do that would be using the --jars flag to include the publicly available and most up-to-date version of the connector:

spark-submit --jars gs://spark-lib/bigquery/spark-bigquery-latest.jar my_job.py

Though the examples reference Cloud Dataproc, this should work when submitting to any Spark cluster.

Answers (1)