Pyspark - what are the differences in behavior between `spark-submit --jars` and `sc._jsc.addJar('myjar.jar')`

Question

So, I have a PySpark program that runs fine with the following command:

spark-submit --jars terajdbc4.jar,tdgssconfig.jar --master local sparkyness.py

And yes its running on local mode and just executing on the master node.

I want to be able to launch my PySpark script though with just:

python sparkyness.py

So, I have added the following lines of code throughtout my PySpark script to facilitate that:

import findspark
findspark.init()



sconf.setMaster("local")



sc._jsc.addJar('/absolute/path/to/tdgssconfig.jar')
sc._jsc.addJar('/absolute/path/to/terajdbc4.jar')

This does not seem to be working though. Everytime I try to run the script with python sparkyness.py I get the error:

py4j.protocol.Py4JJavaError: An error occurred while calling o48.jdbc.
: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

What is the difference between spark-submit --jars and sc._jsc.addJar('myjar.jar') and what could be causing this issue? Do I need to do more than just sc._jsc.addJar()?

Garren S · Accepted Answer

Use spark.jars when building the SparkSession

spark = SparkSession.builder.appName('my_awesome')\
    .config('spark.jars', '/absolute/path/to/jar')\
    .getOrCreate()

Related: Add Jar to standalone pyspark

Edit: I don't recommend hijacking the _jsc, because I don't think that handles distribution of jars to the driver and executors and adds to class path.

Example: I created a new SparkSession without the Hadoop AWS jar then tried to access S3 and here's the error (same error as when adding using sc._jsc.addJar):

Py4JJavaError: An error occurred while calling o35.parquet. : java.io.IOException: No FileSystem for scheme: s3

Then I created a session with the jar and got a new, expected error:

Py4JJavaError: An error occurred while calling o390.parquet. : java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

Pyspark - what are the differences in behavior between `spark-submit --jars` and `sc._jsc.addJar('myjar.jar')`

Answers (1)

Related Questions

Pyspark - what are the differences in behavior between `spark-submit --jars` and `sc._jsc.addJar(&#39;myjar.jar&#39;)`

Answers (1)

Related Questions

Pyspark - what are the differences in behavior between `spark-submit --jars` and `sc._jsc.addJar('myjar.jar')`