AWS Redshift driver in Zeppelin

Question

I want to explore my data in Redshift using notebook Zeppelin. A small EMR cluster with Spark is running behind. I am loading databricks' spark-redshift library

%dep
z.reset()
z.load("com.databricks:spark-redshift_2.10:0.6.0")

and then

import org.apache.spark.sql.DataFrame

val query = "..."

val url = "..."
val port=5439
val table = "..."
val database = "..."
val user = "..."
val password = "..."

val df: DataFrame = sqlContext.read
  .format("com.databricks.spark.redshift")
  .option("url", s"jdbc:redshift://${url}:$port/$database?user=$user&password=$password")
  .option("query",query)
  .option("tempdir", "s3n://.../tmp/data")
  .load()

df.show

but I get the error

java.lang.ClassNotFoundException: Could not load an Amazon Redshift JDBC driver; see the README for instructions on downloading and configuring the official Amazon driver

I added option

option("jdbcdriver", "com.amazon.redshift.jdbc41.Driver")

but not for the better. I think I need to specify redshift's JDBC driver somewhere like I would passing --driver-class-path to spark-shell, but how to do that with zeppelin?

bzz · Accepted Answer

You can add external jars with dependencies like the JDBC driver using either Zeppelin's dependency-loading mechanism or, in case of Spark, using %dep dynamic dependency loader

When your code requires external library, instead of doing download/copy/restart Zeppelin, you can easily do following jobs using %dep interpreter.

Load libraries recursively from Maven repository

Load libraries from local filesystem

Add additional maven repository

Automatically add libraries to SparkCluster (You can turn off)

The latter would look something like:

%dep
// loads with all transitive dependencies from Maven repo
z.load("groupId:artifactId:version")

// or add artifact from filesystem
z.load("/path/to.jar")

and by convention have to be in the first paragraph of the note.

AWS Redshift driver in Zeppelin

Answers (1)

Related Questions