PySpark and Docker image amazon/aws-glue-libs

I know very little about Java, but I have a Java question. I’ve got a Docker container for an image called amazon/aws-glue-libs. This lets me run and test AWS Glue ETL code locally on my Mac without having to use the AWS Console. It also lets me debug and single-step through the ETL code, which is fantastic. However, I hit a snag trying to use JDBC to connect to my RDS MySQL database in my sandbox. The JDBC code works if run in the AWS Glue Console, but dies with a big list of Java messages, the key one being the last line of this:

Traceback (most recent call last): File "/opt/project/glue/etl/script.py", line 697, in .load() File "/home/glue_user/spark/python/pyspark/sql/readwriter.py", line 184, in load return self._df(self._jreader.load()) File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in call File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 190, in deco return f(*a, **kw) File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o241.load. : java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver

Here is a sample of the kind of code I'm trying to run:

person_df = spark.read \
    .format("jdbc") \
    .option("url", JDBC_URL) \
    .option("dbtable", "person") \
    .option("user", USERNAME) \
    .option("password", PASSWORD) \
    .option("driver", "com.mysql.cj.jdbc.Driver") \
    .load()

I can get a bash shell inside the Docker container. Where should I look to find this class/driver/etc? Or what else should I be looking at to resolve this problem?(edited)

Upvotes: 0

Answers (2)

Bogdan

Reputation: 300

I've been through the same struggle, and here's the solution I found:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.mysql:mysql-connector-j:8.4.0 pyspark-shell'

To find the correct version of the MySQL connector, visit mvnrepository.com. Select your desired connector version and then use the string from the Gradle Short/Kotlin version.

You can add multiple dependencies in there as I used here :

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mariadb.jdbc:mariadb-java-client:3.4.0,com.databricks:spark-xml_2.12:0.16.0 pyspark-shell'

Upvotes: 0

change198

Reputation: 2075

execute your command like this spark-submit --jars s3://S3BUCKET/jars/mysql-connector-j-8.0.32.jar SPARK_SCRIPT.py

Upvotes: 0

PySpark and Docker image amazon/aws-glue-libs

Answers (2)

Related Questions