Spark on EMR w/multiple JDBC jars

Question

My setup: small Spark project built w/SBT (+ sbt-assembly for making "fat" jars) that needs to talk to multiple DB backends using JDBC (PostgreSQL + SQL Server in this case, but I think my problem generalizes). I can build + run my project in local driver mode with no problems, using either the fully-shaded JAR or a slim one w/JDBC libs added to the classpath using spark-submit. I've confirmed the classfiles are in my jar and the various drivers are correctly being concatenated into META-INF/services/java.sql.Driver, and can load any of the classes in question via the Scala repl when the fat JAR is in my classpath.

Now the problem: no combination of build options, job submission options, etc. I can puzzle out allow me access to >1 JDBC Driver once I submit the job to EMR. I've tried the plain fat JAR as well as adding the drivers via various spark-submit options (--jars, --packages, etc.). In every case my job throws the good ol' "No suitable driver" error, but only for the second driver to be loaded. One extra wrinkle: I'm submitting the job to EMR via an EC2 host rather than my local development machine (b/c cloud security, that's why) but it's an identical JAR in either case.

One other fun data point: I've verified the driver classes are available at runtime in the EMR job by forcing a Class.forName(...) on each of 'em before actually attempting to connect. Not a single ClassNotFoundException to be seen. Likewise dropping into spark-shell on the EMR master node and running the same code path to grab a DB connection (or more than one!) appears to work fine.

I've been poking at this for a few days now and am honestly starting to worry that it's an underlying classloader issue or something equally obtuse.

A few standard disclaimers: this is not an open source tool so I can't hand out much in the way of source code or raw logs, but I'm happy to look at and report back on anything that can be suitably redacted.

Alper t. Turker · Accepted Answer

Since your investigation doesn't show any obvious problems, it might be just a Spark problem. In that case, explicitly declaring driver class, might help:

val postgresDF = spark.read
  .format("jdbc")
  .option("driver" , "org.postgresql.Driver")
  ...
  .load()

val msSQLDF = spark.read
  .format("jdbc")
  .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
  ...
  .load()

Spark on EMR w/multiple JDBC jars

Answers (1)

Related Questions