rcoder
rcoder

Reputation: 12829

Spark on EMR w/multiple JDBC jars

My setup: small Spark project built w/SBT (+ sbt-assembly for making "fat" jars) that needs to talk to multiple DB backends using JDBC (PostgreSQL + SQL Server in this case, but I think my problem generalizes). I can build + run my project in local driver mode with no problems, using either the fully-shaded JAR or a slim one w/JDBC libs added to the classpath using spark-submit. I've confirmed the classfiles are in my jar and the various drivers are correctly being concatenated into META-INF/services/java.sql.Driver, and can load any of the classes in question via the Scala repl when the fat JAR is in my classpath.

Now the problem: no combination of build options, job submission options, etc. I can puzzle out allow me access to >1 JDBC Driver once I submit the job to EMR. I've tried the plain fat JAR as well as adding the drivers via various spark-submit options (--jars, --packages, etc.). In every case my job throws the good ol' "No suitable driver" error, but only for the second driver to be loaded. One extra wrinkle: I'm submitting the job to EMR via an EC2 host rather than my local development machine (b/c cloud security, that's why) but it's an identical JAR in either case.

One other fun data point: I've verified the driver classes are available at runtime in the EMR job by forcing a Class.forName(...) on each of 'em before actually attempting to connect. Not a single ClassNotFoundException to be seen. Likewise dropping into spark-shell on the EMR master node and running the same code path to grab a DB connection (or more than one!) appears to work fine.

I've been poking at this for a few days now and am honestly starting to worry that it's an underlying classloader issue or something equally obtuse.

A few standard disclaimers: this is not an open source tool so I can't hand out much in the way of source code or raw logs, but I'm happy to look at and report back on anything that can be suitably redacted.

Upvotes: 3

Views: 569

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35219

Since your investigation doesn't show any obvious problems, it might be just a Spark problem. In that case, explicitly declaring driver class, might help:

val postgresDF = spark.read
  .format("jdbc")
  .option("driver" , "org.postgresql.Driver")
  ...
  .load()

val msSQLDF = spark.read
  .format("jdbc")
  .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
  ...
  .load()

Upvotes: 2

Related Questions