Reputation: 12829
My setup: small Spark project built w/SBT (+ sbt-assembly for making "fat" jars) that needs to talk to multiple DB backends using JDBC (PostgreSQL + SQL Server in this case, but I think my problem generalizes). I can build + run my project in local driver mode with no problems, using either the fully-shaded JAR or a slim one w/JDBC libs added to the classpath using spark-submit. I've confirmed the classfiles are in my jar and the various drivers are correctly being concatenated into META-INF/services/java.sql.Driver
, and can load any of the classes in question via the Scala repl when the fat JAR is in my classpath.
Now the problem: no combination of build options, job submission options, etc. I can puzzle out allow me access to >1 JDBC Driver once I submit the job to EMR. I've tried the plain fat JAR as well as adding the drivers via various spark-submit options (--jars
, --packages
, etc.). In every case my job throws the good ol' "No suitable driver" error, but only for the second driver to be loaded. One extra wrinkle: I'm submitting the job to EMR via an EC2 host rather than my local development machine (b/c cloud security, that's why) but it's an identical JAR in either case.
One other fun data point: I've verified the driver classes are available at runtime in the EMR job by forcing a Class.forName(...)
on each of 'em before actually attempting to connect. Not a single ClassNotFoundException
to be seen. Likewise dropping into spark-shell
on the EMR master node and running the same code path to grab a DB connection (or more than one!) appears to work fine.
I've been poking at this for a few days now and am honestly starting to worry that it's an underlying classloader issue or something equally obtuse.
A few standard disclaimers: this is not an open source tool so I can't hand out much in the way of source code or raw logs, but I'm happy to look at and report back on anything that can be suitably redacted.
Upvotes: 3
Views: 569
Reputation: 35219
Since your investigation doesn't show any obvious problems, it might be just a Spark problem. In that case, explicitly declaring driver class, might help:
val postgresDF = spark.read
.format("jdbc")
.option("driver" , "org.postgresql.Driver")
...
.load()
val msSQLDF = spark.read
.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
...
.load()
Upvotes: 2