Reputation: 1214

Overwrite Databricks Dependency

In our project we're using com.typesafe:config in version 1.3.4. According to the latest release notes, this dependency is already provided by Databricks on the cluster, but in a very old version (1.2.1). How can I overwrite the provided dependency with our own version?

We use maven, in our dependencies I have

<dependency>
    <groupId>com.typesafe</groupId>
    <artifactId>config</artifactId>
    <version>1.3.4</version>
</dependency>

Our created jar file should therefore contain the newer version.

I created a Job by uploading the jar file. The Job fails because it can't find a method that was added after version 1.2.1, so it looks like the library we provided gets overwritten by the older version on the cluster.

Upvotes: 8

Answers (3)

nathluu

Reputation: 51

Databricks supports the initialization script (cluster scope or global scope) so that you can install/remove any dependency. The details are at https://docs.databricks.com/clusters/init-scripts.html.

In your initialization script, you can remove the default jar file locates at databricks driver/executor classpath /databricks/jars/ and add the expected one there.

Upvotes: 0

pgruetter

Reputation: 1214

We solved it in the end by utilizing Sparks ChildFirstURLClassLoader. The project is open source so you can check it out yourself here and the usage of the method here.

But for reference, here is the method in its entirety. You need to provide a Seq of jars that you want to override with your own, in our case it's the typesafe config.

def getChildFirstClassLoader(jars: Seq[String]): ChildFirstURLClassLoader = {
  val initialLoader = getClass.getClassLoader.asInstanceOf[URLClassLoader]

  @tailrec
  def collectUrls(clazz: ClassLoader, acc: Map[String, URL]): Map[String, URL] = {

    val urlsAcc: Map[String, URL] = acc++
      // add urls on this level to accumulator
      clazz.asInstanceOf[URLClassLoader].getURLs
      .map( url => (url.getFile.split(Environment.defaultPathSeparator).last, url))
      .filter{ case (name, url) => jars.contains(name)}
      .toMap

    // check if any jars without URL are left
    val jarMissing = jars.exists(jar => urlsAcc.get(jar).isEmpty)
    // return accumulated if there is no parent left or no jars are missing anymore
    if (clazz.getParent == null || !jarMissing) urlsAcc else collectUrls(clazz.getParent, urlsAcc)
  }

  // search classpath hierarchy until all jars are found or we have reached the top
  val urlsMap = collectUrls(initialLoader, Map())

  // check if everything found
  val jarsNotFound = jars.filter( jar => urlsMap.get(jar).isEmpty)
  if (jarsNotFound.nonEmpty) {
    logger.info(s"""available jars are ${initialLoader.getURLs.mkString(", ")} (not including parent classpaths)""")
    throw ConfigurationException(s"""jars ${jarsNotFound.mkString(", ")} not found in parent class loaders classpath. Cannot initialize ChildFirstURLClassLoader.""")
  }
  // create child-first classloader
  new ChildFirstURLClassLoader(urlsMap.values.toArray, initialLoader)
}

As you can see, it also has some logic to abort if the jar files you specified do not exist in the classpath.

Upvotes: 3

Oscar Bonilla

Reputation: 339

In the end we have fixed this by shading the relevant classes, by adding the following to our build.sbt

assemblyShadeRules in assembly := Seq(
  ShadeRule.rename("com.typesafe.config.**" -> "shadedSparkConfigForSpark.@1").inAll
)

Upvotes: 3

Overwrite Databricks Dependency

Answers (3)

Related Questions