sbt: finding correct path to files/folders under resources directory

Question

I've a simple project structure:

WordCount
|
|------------ project
|----------------|---assembly.sbt
|
|------------ resources
|------------------|------ Message.txt
|
|------------ src
|--------------|---main
|--------------------|---scala
|--------------------------|---org
|-------------------------------|---apache
|----------------------------------------|---spark
|----------------------------------------------|---Counter.scala
|
|------------ build.sbt

here's how Counter.scala looks:

package org.apache.spark

object Counter {
    def main(args: Array[String]): Unit = {
        val sc = new SparkContext(new SparkConf())
        val path: String = getClass.getClassLoader.getResource("Message.txt").getPath
        println(s"path = $path")
//      val lines = sc.textFile(path)
//      val wordsCount = lines
//          .flatMap(line => line.split("\s", 2))
//          .map(word => (word, 1))
//          .reduceByKey(_ + _)
//
//      wordsCount.foreach(println)
    }
}

notice that the commented lines are actually correct, but the path variable is not. After building the fat jar with sbt assembly and running it with spark-submit, to see the value of path, I get:

path = file:/home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar!/Message.txt

you can see that path is assigned to the jar location and, mysteriously, followed by !/ and then the file name Message.txt!!
on the other hand when I'm inside the WordCount folder, and I run the repl sbt console and then write

scala> getClass.getClassLoader.getResource("Message.txt").getPath

I get the correct path (without the file:/ prefix)

res1: String = /home/me/WordCount/target/scala-2.11/classes/Message.txt

Question:
1 - why is there two different outputs from the same command? (i.e. getClass.getClassLoader.getResource("...").getPath)
2 - how can I use the correct path, which appears in the console, inside my source file Counter.scala?

for anyone who wants to try it, here's my build.sbt:

name := "Counter"

version := "0.1"

scalaVersion := "2.11.8"

resourceDirectory in Compile := baseDirectory.value / "resources"

// allows us to include spark packages
resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
resolvers += "Typesafe Simple Repository" at "http://repo.typesafe.com/typesafe/simple/maven-releases/"
resolvers += "MavenRepository" at "https://mvnrepository.com/"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0" % "provided"

and the spark-submit command is:

spark-submit --master local --deploy-mode client --class org.apache.spark.Counter /home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar

Eugene Yokota · Accepted Answer

1 - why is there two different outputs from the same command?

By command, I am assuming you mean getClass.getClassLoader.getResource("Message.txt").getPath. So I would rephrase the question as why does the same method call to classloader getResource(...) return two different result depending on sbt console vs spark-submit.

The answer is because they use different classloader with each having different classpath. console uses your directories as classpath while spark-submit uses the fat JAR, which includes resources. When a resource is found in a JAR, the classloader returns a JAR URL, which looks like jar:file:/home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar!/Message.txt.

The whole point of using Apache Spark is to distribute some work across multiple computers, so I don't think you want to see your machine's local path in production.

sbt: finding correct path to files/folders under resources directory

Answers (1)

Related Questions