Reputation: 1885
I know very little about Java, but I have a Java question. I’ve got a Docker container for an image called amazon/aws-glue-libs. This lets me run and test AWS Glue ETL code locally on my Mac without having to use the AWS Console. It also lets me debug and single-step through the ETL code, which is fantastic. However, I hit a snag trying to use JDBC to connect to my RDS MySQL database in my sandbox. The JDBC code works if run in the AWS Glue Console, but dies with a big list of Java messages, the key one being the last line of this:
Traceback (most recent call last): File "/opt/project/glue/etl/script.py", line 697, in .load() File "/home/glue_user/spark/python/pyspark/sql/readwriter.py", line 184, in load return self._df(self._jreader.load()) File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in call File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 190, in deco return f(*a, **kw) File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o241.load. : java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
Here is a sample of the kind of code I'm trying to run:
person_df = spark.read \
.format("jdbc") \
.option("url", JDBC_URL) \
.option("dbtable", "person") \
.option("user", USERNAME) \
.option("password", PASSWORD) \
.option("driver", "com.mysql.cj.jdbc.Driver") \
.load()
I can get a bash shell inside the Docker container. Where should I look to find this class/driver/etc? Or what else should I be looking at to resolve this problem?(edited)
Upvotes: 0
Views: 256
Reputation: 300
I've been through the same struggle, and here's the solution I found:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.mysql:mysql-connector-j:8.4.0 pyspark-shell'
To find the correct version of the MySQL connector, visit mvnrepository.com. Select your desired connector version and then use the string from the Gradle Short/Kotlin version.
You can add multiple dependencies in there as I used here :
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mariadb.jdbc:mariadb-java-client:3.4.0,com.databricks:spark-xml_2.12:0.16.0 pyspark-shell'
Upvotes: 0
Reputation: 2075
execute your command like this
spark-submit --jars s3://S3BUCKET/jars/mysql-connector-j-8.0.32.jar SPARK_SCRIPT.py
Upvotes: 0