“Spark-PySpark Redshift JDBC Write: No suitable driver / ClassNotFoundException: com.amazon.redshift.jdbc42.Driver Errors”

Question

I’m trying to write a DataFrame from Spark (PySpark) to an Amazon Redshift Serverless cluster using the Redshift JDBC driver.

I keep running into driver-related errors:

•   java.sql.SQLException: No suitable driver

•   java.lang.ClassNotFoundException: com.amazon.redshift.jdbc42.Driver

What I’ve Tried:

1.  Setup:

•   Spark version: (e.g., Spark 3.3.1)

•   Hadoop AWS packages: --packages org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901

•   Redshift JDBC driver: RedshiftJDBC42-2.1.0.30.jar downloaded from Amazon’s official site.

2.  spark-submit command:

spark-submit
--conf spark.driver.bindAddress=127.0.0.1
--conf spark.driver.host=127.0.0.1
--driver-memory 4g
--packages org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901
--jars /path/to/RedshiftJDBC42-2.1.0.30.jar
--driver-class-path /path/to/RedshiftJDBC42-2.1.0.30.jar
my_script.py

I’ve tried adding --driver-class-path so that the driver is visible to the driver. The JAR file definitely exists at the specified path.

3.  In the Python Code:

jdbc_url = "jdbc:redshift://:5439/dev" (df.write .format("jdbc") .option("url", jdbc_url) .option("dbtable", "public.my_staging_table") .option("user", os.environ["REDSHIFT_USER"]) .option("password", os.environ["REDSHIFT_PASSWORD"]) .option("driver", "com.amazon.redshift.jdbc42.Driver") .mode("append") .save())

The code runs fine until the .save() step, at which point I get No suitable driver or a ClassNotFoundException for the Redshift driver class.

What I Know:

•   The Redshift JDBC driver class should be com.amazon.redshift.jdbc42.Driver.

•   I’ve seen suggestions to use --driver-class-path plus --jars to ensure the driver is on both driver and executor classpaths.

•   If I remove --driver-class-path, I sometimes get ClassNotFoundException. With it, I still get No suitable driver.

•   My AWS credentials and S3 reading works fine (I can read JSON from S3). The problem occurs only at the JDBC write to Redshift step.

Questions:

•   Is there another configuration step needed to ensure Spark recognizes the Redshift driver?

•   Do I need to specify any additional spark configs for the JDBC driver?

•   Are there known compatibility issues with this Redshift driver version and Spark/Hadoop versions?

•   Should I consider a different Redshift driver JAR or a different approach (like spark-redshift or redshift-jdbc42-no-awssdk JAR)?

Any guidance on resolving the No suitable driver and ClassNotFoundException errors when writing to Redshift via JDBC in Spark would be greatly appreciated.

“Spark-PySpark Redshift JDBC Write: No suitable driver / ClassNotFoundException: com.amazon.redshift.jdbc42.Driver Errors”

Answers (1)

Related Questions