Reputation: 649
I've a simple project structure:
WordCount
|
|------------ project
|----------------|---assembly.sbt
|
|------------ resources
|------------------|------ Message.txt
|
|------------ src
|--------------|---main
|--------------------|---scala
|--------------------------|---org
|-------------------------------|---apache
|----------------------------------------|---spark
|----------------------------------------------|---Counter.scala
|
|------------ build.sbt
here's how Counter.scala
looks:
package org.apache.spark
object Counter {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf())
val path: String = getClass.getClassLoader.getResource("Message.txt").getPath
println(s"path = $path")
// val lines = sc.textFile(path)
// val wordsCount = lines
// .flatMap(line => line.split("\\s", 2))
// .map(word => (word, 1))
// .reduceByKey(_ + _)
//
// wordsCount.foreach(println)
}
}
notice that the commented lines are actually correct, but the path
variable is not. After building the fat jar with sbt assembly
and running it with spark-submit
, to see the value of path
, I get:
path = file:/home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar!/Message.txt
you can see that path
is assigned to the jar location and, mysteriously, followed by !/
and then the file name Message.txt
!!
on the other hand when I'm inside the WordCount folder, and I run the repl sbt console
and then write
scala> getClass.getClassLoader.getResource("Message.txt").getPath
I get the correct path (without the file:/
prefix)
res1: String = /home/me/WordCount/target/scala-2.11/classes/Message.txt
Question:
1 - why is there two different outputs from the same command? (i.e. getClass.getClassLoader.getResource("...").getPath
)
2 - how can I use the correct path, which appears in the console, inside my source file Counter.scala
?
build.sbt
:
name := "Counter"
version := "0.1"
scalaVersion := "2.11.8"
resourceDirectory in Compile := baseDirectory.value / "resources"
// allows us to include spark packages
resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
resolvers += "Typesafe Simple Repository" at "http://repo.typesafe.com/typesafe/simple/maven-releases/"
resolvers += "MavenRepository" at "https://mvnrepository.com/"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0" % "provided"
and the spark-submit command is:
spark-submit --master local --deploy-mode client --class org.apache.spark.Counter /home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar
Upvotes: 1
Views: 2191
Reputation: 95624
1 - why is there two different outputs from the same command?
By command, I am assuming you mean getClass.getClassLoader.getResource("Message.txt").getPath
. So I would rephrase the question as why does the same method call to classloader getResource(...)
return two different result depending on sbt console
vs spark-submit
.
The answer is because they use different classloader with each having different classpath. console
uses your directories as classpath while spark-submit
uses the fat JAR, which includes resources. When a resource is found in a JAR, the classloader returns a JAR URL, which looks like jar:file:/home/me/WordCount/target/scala-2.11/Counter-assembly-0.1.jar!/Message.txt
.
The whole point of using Apache Spark is to distribute some work across multiple computers, so I don't think you want to see your machine's local path in production.
Upvotes: 2