Reputation: 31
I want to use some Maven repository JAR files in my Spark session so I am creating the session with 'spark.jars.packages' which would automatically download the JARs. This is not working as expected as I am having the Session config correctly configured (('spark.jars.packages', 'net.snowflake:snowflake-jdbc:3.13.6,net.snowflake:spark-snowflake_2.12:2.9.0-spark_3.1'),
).
But I still have the error: "Failed to find data source: net.snowflake.spark.snowflake. Please find packages at https://spark.apache.org/third-party-projects.html" which would be solved if I upload the JARs manually.
I am using Glue v4.
If I update the JARs manually it is working I need them to download automatically.
What can I try next?
Upvotes: 2
Views: 1103
Reputation: 31
Glue doesn't allow dynamic loading of packages using "spark.jars.packages".
To add dependencies need to use the magics %additional_python_modules and %extra_jars In the case of Python you can reference directly to pip modules but in the case of the jars, it doesn't accept maven coordinates, unfortunately, you need to get the jars, put then on s3 and then reference then using %extra_jars.
Upvotes: 0
Reputation: 2468
The following code works for me. Copy this code in your script and rerun. Mostly issue would be with maven coordinates not provided correctly.
Example code :
from pyspark import SQLContext
from pyspark.sql.functions import *
import pyspark.sql.functions as F
packages_so = 'net.snowflake:snowflake-jdbc:3.13.6,net.snowflake:spark-snowflake_2.12:2.9.0-spark_3.1'
repository = "https://repo1.maven.org/maven2"
spark = SparkSession \
.builder \
.config("spark.jars.packages", packages_so) \
.config("spark.jars.repositories", repository) \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
Upvotes: 0