Sam
Sam

Reputation: 517

what's the purpose and usecase of --files in spark-submit?

I have two files, locally for now later can be on S3/HDFS, etc. userfile is ~75mb ~1 million records. a location file is ~150kb ~7000 records.

I want to read the files and send the path from command-line.

I'm confused whether should I just send the full path of the file as an argument to the main or use --files flag?

If yes, Should only small files(what size?) be sent through --files flag as it puts the file to each executor as there's a Transfer involved?

I have code like this

override def run(spark: SparkSession, config: RecipeCookingTimeConfig, storage: Storage): Unit = {

    /**
      * Only I/O here
      * Transformations and Pre-Processing go in separate functions
      */
    MyLogger.log.setLevel(Level.WARN)

    val userFilePath =
      if (config.userFileName.isEmpty) "/tmp/data/somefile.json"
      else SparkFiles.get(config.userFileName)
    val userData = storage.read(ReadConfig("json", userFilePath)) match {
      case Success(value) => value
      case Failure(ex)    => spark.stop(); System.exit(1); spark.emptyDataFrame
    }

    val airportFilePath =
      if (config.airportFileName.isEmpty) "/tmp/data/somefile2.json"
      else SparkFiles.get(config.airportFileName)
    val airportData = storage.read(ReadConfig("json", airportFilePath)) match {
      case Success(value) => value
      case Failure(ex)    => spark.stop(); System.exit(1); spark.emptyDataFrame
    }
  }

Upvotes: 0

Views: 3035

Answers (2)

user6860682
user6860682

Reputation:

Comma-separated list of files to be placed in the working directory of each executor. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster. File paths of these files in executors can be accessed via SparkFiles.get(fileName).

Additional tweaks for --files:

  • spark.files.fetchTimeout
  • spark.files.useFetchCache
  • spark.files.overwrite
  • spark.files.maxPartitionBytes
  • spark.files.openCostInBytes

More details in the official docs.

Upvotes: 1

Ged
Ged

Reputation: 18053

--files comma-separated files list

  • Comma-separated list of files that are deposited in the working directory of each and every Executor using YARN Cluster Mode if memory serves correctly.

  • Use case is (although never used myself) is configuration info that you can read in as opposed to using args[x] approach.

Upvotes: 2

Related Questions