This is my pyspark configuration. Ive followed the steps mentioned here and didnt create a sparkcontext.
spark = SparkSession \
.builder \
.appName(appName) \
.config(conf=spark_conf) \
.config('spark.jars.packages', '') \
.config('spark.jars.packages','') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar,spark-bigquery-with-dependencies_2.12-0.21.1.jar,spark-bigquery-latest_2.11.jar') \
.config('spark.jars', 'postgresql-42.2.23.jar,bigquery-connector-hadoop2-latest.jar') \
Then when i try to write a demo spark dataframe to bigquery
df.write.format('bigquery') \
.mode(mode) \
.option("credentialsFile", "creds.json") \
.option('table', table) \
.option("temporaryGcsBucket",bucket) \
It throws and error
File "c:\sparktest\vnenv\lib\site-packages\py4j\", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling
: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at
Upvotes: 0
Views: 3633
My problem was with faulty jar versions. I am using spark 3.1.2 and hadoop 3.2 this was the maven jars with code which worked for me.
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config('spark.jars.packages',',,') \
.config('spark.jars','guava-11.0.1.jar,gcsio-1.9.0-javadoc.jar') \ # you will have to download these jars manually
Upvotes: 2