In sbt, how can we specify the version of hadoop on which spark depends?

Question

Well I have a sbt project which uses spark and spark sql, but my cluster uses hadoop 1.0.4 and spark 1.2 with spark-sql 1.2, currently my build.sbt looks like this:

libraryDependencies ++= Seq(
    "com.datastax.cassandra" % "cassandra-driver-core" % "2.1.5",
    "com.datastax.cassandra" % "cassandra-driver-mapping" % "2.1.5",
    "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.2.1",
    "org.apache.spark" % "spark-core_2.10" % "1.2.1",
    "org.apache.spark" % "spark-sql_2.10" % "1.2.1",
)

It turns out that I am running the app with hadoop 2.2.0, but I wish to see hadoop-*-1.0.4 in my dependencies. What would I do please?

Svend · Accepted Answer

You can exclude the dependency from Spark to hadoop, and add an explicit one with the version you need, something along those lines:

libraryDependencies ++= Seq(
    "com.datastax.cassandra" % "cassandra-driver-core" % "2.1.5",
    "com.datastax.cassandra" % "cassandra-driver-mapping" % "2.1.5",
    "com.datastax.spark" % "spark-cassandra-connector" %% "1.2.1",
    "org.apache.spark" % "spark-sql_2.10" % "1.2.1" excludeAll(
         ExclusionRule("org.apache.hadoop")
    ),
    "org.apache.hadoop" % "hadoop-client" % "2.2.0"
)

You probably do not need the dependency to spark-core since spark-sql should transitively bring it to you.

Also, watch out that spark-cassandra-connector probably also depends on spark, which could again transitively bring back hadoop => you might need to add an exclusion rule there as well.

Last note: an excellent tool for investigating which dependency comes from where is https://github.com/jrudolph/sbt-dependency-graph

In sbt, how can we specify the version of hadoop on which spark depends?

Answers (1)

Related Questions