Marcin Cylke
Marcin Cylke

Reputation: 2066

Passing configuration to Spark Job

I'd like to have an external config file that I'd pass to SparkJobs. Assuming I'm running my job from an assembly and config file in on my local filesystem:

spark-class my.assembly.jar my_application.conf

It would be great if I could access config file in spark job, but its not possible, its main method is executed on another node.

I've been trying to use --files argument for spark-class, but this does not seem to work.

Similar behavior (to --files) tried in spark repl ends with error:

val inFile = sc.textFile(SparkFiles.get("conf.a"))
inFile.first()

The above assumes file conf.a has been passed to spark-class with --files option.

Any thoughts on this problem? How to fix the issue? I'd really like to use external file as configuration source.

I'm using apache-spark-0.9.0

Upvotes: 2

Views: 4372

Answers (2)

Mayur Rustagi
Mayur Rustagi

Reputation: 1

Easiest is to load the file into HDFS cluster. The tutorial you linked assumes that file is present in HDFS & hence can be accessed across the cluster. If you cannot then addfile argument given by Freidereikhs will work for you but then you have to bundle the conf file with the application.

Upvotes: 0

fedragon
fedragon

Reputation: 884

You can use sc.addFile(path) to make your file visible to all the nodes:

object MySparkApp extends App {
  override def main(args: Array[String]) {
    val sc = new SparkContext("local", "MySparkApp", "/opt/spark", jarOfObject(this.getClass))

    sc.addFile(args(1))

    val rdd = sc.textFile(SparkFiles.get("conf.a"))
  }
}

> sbt run MySparkApp /tmp/conf.a

Note that when using SparkFiles.get(path) I'm only giving the file name, not the full path: this is because the file comes from my local filesystem, so it will be copied to the job working directory.

Upvotes: 1

Related Questions