Load CSV file as dataframe from resources within an Uber Jar

Question

So, I made an Scala Application to run in Spark, and created the Uber Jar using sbt> assembly.

The file I load is a lookup needed by the application, thus the idea is to package it together. It works fine from within InteliJ using the path "src/main/resources/lookup01.csv"

I am developing in Windows, testing locally, to after deploy it to a remote test server.

But when I call spark-submit on the Windows machine, I get the error :

"org.apache.spark.sql.AnalysisException: Path does not exist: file:/H:/dev/Spark/spark-2.4.3-bin-hadoop2.7/bin/src/main/resources/"

Seems it tries to find the file in the sparkhome location instead of from inside the JAr file.

How could I express the Path so it works looking the file from within the JAR package?

Example code of the way I load the Dataframe. After loading it I transform it into other structures like Maps.

val v_lookup = sparkSession.read.option( "header", true ).csv( "src/main/resources/lookup01.csv")

What I would like to achieve is getting as way to express the path so it works in every environment I try to run the JAR, ideally working also from within InteliJ while developing.

Edit: scala version is 2.11.12

Update:

Seems that to get a hand in the file inside the JAR, I have to read it as a stream, the bellow code worked, but I cant figure out a secure way to extract the headers of the file such as SparkSession.read.option has.

val fileStream = scala.io.Source.getClass.getResourceAsStream("/lookup01.csv")
val inputDF = sparkSession.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList).toDF

When the makeRDD is applied, I get the RDD and then can convert it to a dataframe, but it seems I lost the ability tu use the option from "read" that parsed out the headers as the schema.

Any way around it when using makeRDD ?

Other problem with this is that seems that I will have to manually parse the lines into columns.

Welsige · Accepted Answer

So, it all points to that after the file is inside JAR, it can only be accessed as a inputstream to read the chunk of data from within the compressed file.

I arrived at a solution, even though its not pretty it does what I need, that is to read a csv file, take the 2 first columns and make it into a dataframe and after load it inside a key-value structure (in this case i created a case class to hold these pairs).

I am considering migrating these lookups to a HOCON file, that may make the process less convoluted to load these lookups


import sparkSession.implicits._
val fileStream = scala.io.Source.getClass.getResourceAsStream("/lookup01.csv")
val input = sparkSession.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList).toDF()

val myRdd = input.map {
      line =>
        val col = utils.Utils.splitCSVString(line.getString(0))
        KeyValue(col(0), col(1))
    }

val myDF = myRdd.rdd.map(x => (x.key, x.value)).collectAsMap()

fileStream.close()

Load CSV file as dataframe from resources within an Uber Jar

Answers (2)

Related Questions