How to create single jar in scala with sbt assembly on AWS EMR? Running into deduplicate: different file contents found in the following: errors

Question

I'm on an AWS EMR cluster I just stood up, have a scala file compiles and one which I would like to build into an assembly. However, when I issue sbt assembly I run into deduplicate errors.

Per https://medium.com/@tedherman/compile-scala-on-emr-cb77610559f0 I originally had a symbolic link for my lib to usr lib spark jars;

ln -s /usr/lib/spark/jars lib

although I've noticed my code passes an sbt compile with or without this. I am confused why/how to resolve sbt assembly dupe errors however. I'll also note that I install sbt in bootstrap actions as per the article.

With the symbolic link in

Some dedupes appear to be clear exact dupes; example:

[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.parquet/parquet-jackson/jars/parquet-jackson-1.10.1.jar:shaded/parquet/org/codehaus/jackson/util/CharTypes.class
[error] /usr/lib/spark/jars/parquet-jackson-1.10.1-spark-amzn-1.jar:shaded/parquet/org/codehaus/jackson/util/CharTypes.class

Others seem to be competing versions;

[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-core_2.11/jars/spark-core_2.11-2.4.3.jar:org/spark_project/jetty/util/MultiPartOutputStream.class
[error] /usr/lib/spark/jars/spark-core_2.11-2.4.5-amzn-0.jar:org/spark_project/jetty/util/MultiPartOutputStream.class

I don't understand why there are competing versions; or if they are by default like this or I did something to introduce them.

Without the symbolic link

I thought if I removed this, I would have less problems; although I still have dupes (just less);

[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.hadoop/hadoop-yarn-api/jars/hadoop-yarn-api-2.6.5.jar:org/apache/hadoop/yarn/factory/providers/package-info.class
[error] /home/hadoop/.ivy2/cache/org.apache.hadoop/hadoop-yarn-common/jars/hadoop-yarn-common-2.6.5.jar:org/apache/hadoop/yarn/factory/providers/package-info.class

I don't understand why the above is a dupe considering one is hadoop-yarn-api-2.6.5.jar and the other is hadoop-yarn-common-2.6.5.jar. Different names so why the dupe?

Others appear to be versions;

[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/javax.inject/javax.inject/jars/javax.inject-1.jar:javax/inject/Named.class
[error] /home/hadoop/.ivy2/cache/org.glassfish.hk2.external/javax.inject/jars/javax.inject-2.4.0-b34.jar:javax/inject/Named.class

Some have a same filename but are different paths/jars...

[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.arrow/arrow-format/jars/arrow-format-0.10.0.jar:git.properties
[error] /home/hadoop/.ivy2/cache/org.apache.arrow/arrow-memory/jars/arrow-memory-0.10.0.jar:git.properties
[error] /home/hadoop/.ivy2/cache/org.apache.arrow/arrow-vector/jars/arrow-vector-0.10.0.jar:git.properties

Same with these...

[error] deduplicate: different file contents found in the following:
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-catalyst_2.11/jars/spark-catalyst_2.11-2.4.3.jar:org/apache/spark/unused/UnusedStubClass.class
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-core_2.11/jars/spark-core_2.11-2.4.3.jar:org/apache/spark/unused/UnusedStubClass.class
[error] /home/hadoop/.ivy2/cache/org.apache.spark/spark-graphx_2.11/jars/spark-graphx_2.11-2.4.3.jar:org/apache/spark/unused/UnusedStubClass.class

For reference, some other info

imports into my scala object

import org.apache.spark.sql.SparkSession
import java.time.LocalDateTime
import com.amazonaws.regions.Regions
import com.amazonaws.services.secretsmanager.AWSSecretsManagerClientBuilder
import com.amazonaws.services.secretsmanager.model.GetSecretValueRequest
import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import com.datarobot.prediction.spark.Predictors.{getPredictorFromServer, getPredictor}

My build.sbt

libraryDependencies ++= Seq(
  "net.snowflake" % "snowflake-jdbc" % "3.12.5",
  "net.snowflake" % "spark-snowflake_2.11" % "2.7.1-spark_2.4",
  "com.datarobot" % "scoring-code-spark-api_2.4.3" % "0.0.19",
  "com.datarobot" % "datarobot-prediction" % "2.1.4",
  "com.amazonaws" % "aws-java-sdk-secretsmanager" % "1.11.789",
  "software.amazon.awssdk" % "regions" % "2.13.23"
)

Thoughts? Please advise.

Saša Zejnilović · Accepted Answer

You will need a mergeStrategy setup (docs).

"Random example:"

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", _)       => MergeStrategy.discard
  case PathList("git.properties", _) => MergeStrategy.discard
  case "application.conf"            => MergeStrategy.concat
  case "reference.conf"              => MergeStrategy.concat
  case _                             => MergeStrategy.first
}

How to create single jar in scala with sbt assembly on AWS EMR? Running into deduplicate: different file contents found in the following: errors

Answers (1)

Related Questions