Reputation:
Hello I am trying to use some resource file which is a csv file, which I load in the spark application when I run it using
val resSource: BufferedSource = Source.fromResource("res.csv")
This work when I am using IDE and the csv is in resource folder.
But when I am trying to use assembly to create a fat JAR, This this code throws null pointer exception.
What is the best way to read external files when the JAR is deployed using spark-submit ?
I tried already answered solutions like getClass.getResources etc , these solutions dont work. Can someone help me how to do it ? I want to know how to use the --files options and how to access it in the application ?
Upvotes: 0
Views: 633
Reputation: 492
$ spark-shell --files data.txt
scala> import java.io.File
import java.io.File
scala> import org.apache.spark.SparkFiles
import org.apache.spark.SparkFiles
scala> val rootDir = new File(SparkFiles.getRootDirectory())
rootDir: java.io.File = /private/var/folders/lr/shv51xn15zqdtzb67c7f4ydc0000gn/T/spark-ce4fd20e-ebd2-4c05-b6a7-39c338332dd3/userFiles-3c78071d-75bb-47e1-a3e0-0be2b83c1600
scala> rootDir.listFiles.foreach(x => println(x.getName))
data.txt
scala> val dataFile = SparkFiles.get("data.txt")
dataFile: String = /private/var/folders/lr/shv51xn15zqdtzb67c7f4ydc0000gn/T/spark-ce4fd20e-ebd2-4c05-b6a7-39c338332dd3/userFiles-3c78071d-75bb-47e1-a3e0-0be2b83c1600/data.txt
scala> spark.read.text(dataFile).show()
+---------------+
| value|
+---------------+
|'$1,200.00',abc|
|'$1,201.00',und|
|'$1,202.00',jsn|
|'$1,203.00',yhs|
|'$1,204.00',rfs|
|'$1,205.00',jsn|
|'$1,202.00',han|
+---------------+
https://spark.apache.org/docs/2.4.5/api/scala/#org.apache.spark.SparkFiles$
Upvotes: 1