Neeleshkumar S
Neeleshkumar S

Reputation: 776

How do we specify maven dependencies in pyspark

While starting spark-submit / pyspark, we do have an option of specifying the jar files using the --jars option. How can we specify maven dependencies in pyspark. Do we have to pass all the jars all the time when running a pyspark application or there is a cleaner way ?

Upvotes: 6

Views: 11300

Answers (2)

Vzzarr
Vzzarr

Reputation: 5660

Another way I find very practical for testing/developing is when creating the SparkSession within the script, in particular by adding the config option and passing the Maven packages dependencies through spark.jars.packages in this way:

from pyspark.sql import SparkSession


spark = SparkSession.builder.master("local[*]")\
        .config('spark.jars.packages', 'groupId:artifactId:version')\
        .getOrCreate()

This will automatically download the specified dependencies (for more than one package dependency specify in a comma-separated fashion) from the Maven repository (so double check your internet connection).

At the same way any other Spark parameter listed here can be passed to the Spark Session.

For the full list of Maven packages please refer to https://mvnrepository.com/

Upvotes: 6

Martin Kretz
Martin Kretz

Reputation: 1543

According to https://spark.apache.org/docs/latest/submitting-applications.html there is an option to specify --packages in the form of a comma-delimited list of Maven coordinates.

./bin/spark-submit --packages my:awesome:package

Upvotes: 3

Related Questions