Reputation: 2597
I’m trying to write a DataFrame from Spark (PySpark) to an Amazon Redshift Serverless cluster using the Redshift JDBC driver.
I keep running into driver-related errors:
• java.sql.SQLException: No suitable driver
• java.lang.ClassNotFoundException: com.amazon.redshift.jdbc42.Driver
What I’ve Tried:
1. Setup:
• Spark version: (e.g., Spark 3.3.1)
• Hadoop AWS packages: --packages org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901
• Redshift JDBC driver: RedshiftJDBC42-2.1.0.30.jar downloaded from Amazon’s official site.
2. spark-submit command:
spark-submit
--conf spark.driver.bindAddress=127.0.0.1
--conf spark.driver.host=127.0.0.1
--driver-memory 4g
--packages org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901
--jars /path/to/RedshiftJDBC42-2.1.0.30.jar
--driver-class-path /path/to/RedshiftJDBC42-2.1.0.30.jar
my_script.py
I’ve tried adding --driver-class-path so that the driver is visible to the driver. The JAR file definitely exists at the specified path.
3. In the Python Code:
jdbc_url = "jdbc:redshift://:5439/dev" (df.write .format("jdbc") .option("url", jdbc_url) .option("dbtable", "public.my_staging_table") .option("user", os.environ["REDSHIFT_USER"]) .option("password", os.environ["REDSHIFT_PASSWORD"]) .option("driver", "com.amazon.redshift.jdbc42.Driver") .mode("append") .save())
The code runs fine until the .save() step, at which point I get No suitable driver or a ClassNotFoundException for the Redshift driver class.
What I Know:
• The Redshift JDBC driver class should be com.amazon.redshift.jdbc42.Driver.
• I’ve seen suggestions to use --driver-class-path plus --jars to ensure the driver is on both driver and executor classpaths.
• If I remove --driver-class-path, I sometimes get ClassNotFoundException. With it, I still get No suitable driver.
• My AWS credentials and S3 reading works fine (I can read JSON from S3). The problem occurs only at the JDBC write to Redshift step.
Questions:
• Is there another configuration step needed to ensure Spark recognizes the Redshift driver?
• Do I need to specify any additional spark configs for the JDBC driver?
• Are there known compatibility issues with this Redshift driver version and Spark/Hadoop versions?
• Should I consider a different Redshift driver JAR or a different approach (like spark-redshift or redshift-jdbc42-no-awssdk JAR)?
Any guidance on resolving the No suitable driver and ClassNotFoundException errors when writing to Redshift via JDBC in Spark would be greatly appreciated.
Upvotes: 1
Views: 71
Reputation: 11
Ciao, in your jdbc_url, jdbc:redshift://:5439/dev, host and database parts are missing.
The JDBC url must have the following form:
jdbc:redshift://<host>:<port>/<database>
In example:
jdbc:redshift://mycluster.myclusteruuid.eu-west-1.redshift.amazonaws.com:5439/mydatabase
If you need to know your JDBC url you can copy it from AWS, the cluster details page will show you your JDBC and ODBC urls.
BTW it is normal that you get the error only at the .save() step as Spark evaluate the dag only at the first action.
Upvotes: 0