Reputation: 517

what's the purpose and usecase of --files in spark-submit?

I have two files, locally for now later can be on S3/HDFS, etc. userfile is ~75mb ~1 million records. a location file is ~150kb ~7000 records.

I want to read the files and send the path from command-line.

I'm confused whether should I just send the full path of the file as an argument to the main or use --files flag?

If yes, Should only small files(what size?) be sent through --files flag as it puts the file to each executor as there's a Transfer involved?

I have code like this

override def run(spark: SparkSession, config: RecipeCookingTimeConfig, storage: Storage): Unit = {

    /**
      * Only I/O here
      * Transformations and Pre-Processing go in separate functions
      */
    MyLogger.log.setLevel(Level.WARN)

    val userFilePath =
      if (config.userFileName.isEmpty) "/tmp/data/somefile.json"
      else SparkFiles.get(config.userFileName)
    val userData = storage.read(ReadConfig("json", userFilePath)) match {
      case Success(value) => value
      case Failure(ex)    => spark.stop(); System.exit(1); spark.emptyDataFrame
    }

    val airportFilePath =
      if (config.airportFileName.isEmpty) "/tmp/data/somefile2.json"
      else SparkFiles.get(config.airportFileName)
    val airportData = storage.read(ReadConfig("json", airportFilePath)) match {
      case Success(value) => value
      case Failure(ex)    => spark.stop(); System.exit(1); spark.emptyDataFrame
    }
  }

Upvotes: 0

Answers (2)

user6860682

Reputation:

Comma-separated list of files to be placed in the working directory of each executor. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster. File paths of these files in executors can be accessed via SparkFiles.get(fileName).

Additional tweaks for --files:

spark.files.fetchTimeout
spark.files.useFetchCache
spark.files.overwrite
spark.files.maxPartitionBytes
spark.files.openCostInBytes

More details in the official docs.

Upvotes: 1

Ged

Reputation: 18053

--files comma-separated files list

Comma-separated list of files that are deposited in the working directory of each and every Executor using YARN Cluster Mode if memory serves correctly.
Use case is (although never used myself) is configuration info that you can read in as opposed to using args[x] approach.

Upvotes: 2

what&#39;s the purpose and usecase of --files in spark-submit?

Answers (2)

Related Questions

what's the purpose and usecase of --files in spark-submit?