Reputation: 5554
I use the following Scala code to create a text file in S3, with Apache Spark on AWS EMR.
def createS3OutputFile() {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
// use s3n !
val outputFileUri = s"s3n://$s3Bucket/emr-output/test-3.txt"
val arr = Array("hello", "World", "!")
val rdd = spark.parallelize(arr)
rdd.saveAsTextFile(outputFileUri)
spark.stop()
}
def main(args: Array[String]): Unit = {
createS3OutputFile()
}
I create a fat JAR and upload it to S3. I then SSH into the cluster master and run the code with:
spark-submit \
--deploy-mode cluster \
--class "$class_name" \
"s3://$s3_bucket/$app_s3_key"
I am seeing this in the S3 console: instead of files there are folders.
Each folder (for example test-3.txt) contains a long list of block files. Picture below:
How do I output a simple text file to S3 as the output of my Spark job?
Upvotes: 3
Views: 18432
Reputation: 8957
Spark is distributed computing. It means your code is running on multiple nodes.
saveAsTextFile()
method accepts file path
, not the file name.
You could use coalesce
() or repartition
to decrease the number of part files. But still it will be created under the file path.
Alternatively, you can change the file name or merge multiple part files to single part file, using FileUtil
class from Hadoop File System
.
Store RDD to S3
rdd.saveAsTextFile("s3n://bucket/path/")
Also, check this
Upvotes: -1
Reputation: 3711
Try doing this:
rdd.coalesce(1, shuffle = true).saveAsTextFile(...)
My understanding is that the shuffle = true
argument will cause this to occur in parallel so it will output a single text file, but do be careful with massive data files.
Here are some more details on this issue at hand.
Upvotes: 5