Reputation: 517
I have two files, locally for now later can be on S3/HDFS, etc. userfile is ~75mb ~1 million records. a location file is ~150kb ~7000 records.
I want to read the files and send the path from command-line.
I'm confused whether should I just send the full path of the file as an argument to the main or use --files flag?
If yes, Should only small files(what size?) be sent through --files flag as it puts the file to each executor as there's a Transfer involved?
I have code like this
override def run(spark: SparkSession, config: RecipeCookingTimeConfig, storage: Storage): Unit = {
/**
* Only I/O here
* Transformations and Pre-Processing go in separate functions
*/
MyLogger.log.setLevel(Level.WARN)
val userFilePath =
if (config.userFileName.isEmpty) "/tmp/data/somefile.json"
else SparkFiles.get(config.userFileName)
val userData = storage.read(ReadConfig("json", userFilePath)) match {
case Success(value) => value
case Failure(ex) => spark.stop(); System.exit(1); spark.emptyDataFrame
}
val airportFilePath =
if (config.airportFileName.isEmpty) "/tmp/data/somefile2.json"
else SparkFiles.get(config.airportFileName)
val airportData = storage.read(ReadConfig("json", airportFilePath)) match {
case Success(value) => value
case Failure(ex) => spark.stop(); System.exit(1); spark.emptyDataFrame
}
}
Upvotes: 0
Views: 3035
Reputation:
Comma-separated list of files to be placed in the working directory of each executor. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster. File paths of these files in executors can be accessed via SparkFiles.get(fileName)
.
Additional tweaks for --files
:
More details in the official docs.
Upvotes: 1
Reputation: 18053
--files comma-separated files list
Comma-separated list of files that are deposited in the working directory of each and every Executor using YARN Cluster Mode if memory serves correctly.
Use case is (although never used myself) is configuration info that you can read in as opposed to using args[x] approach.
Upvotes: 2