Daniel Kats
Daniel Kats

Reputation: 5554

Write to a file in S3 using Spark on EMR

I use the following Scala code to create a text file in S3, with Apache Spark on AWS EMR.

def createS3OutputFile() {
    val conf = new SparkConf().setAppName("Spark Pi")
    val spark = new SparkContext(conf)
    // use s3n !
    val outputFileUri = s"s3n://$s3Bucket/emr-output/test-3.txt"
    val arr = Array("hello", "World", "!")
    val rdd = spark.parallelize(arr)
    rdd.saveAsTextFile(outputFileUri)
    spark.stop()
  }

def main(args: Array[String]): Unit = {
    createS3OutputFile()
  }

I create a fat JAR and upload it to S3. I then SSH into the cluster master and run the code with:

spark-submit \
    --deploy-mode cluster \
    --class "$class_name" \
    "s3://$s3_bucket/$app_s3_key"

I am seeing this in the S3 console: instead of files there are folders.

enter image description here

Each folder (for example test-3.txt) contains a long list of block files. Picture below:

enter image description here

How do I output a simple text file to S3 as the output of my Spark job?

Upvotes: 3

Views: 18432

Answers (2)

Shankar
Shankar

Reputation: 8957

Spark is distributed computing. It means your code is running on multiple nodes.

saveAsTextFile() method accepts file path, not the file name.

You could use coalesce() or repartition to decrease the number of part files. But still it will be created under the file path.

Alternatively, you can change the file name or merge multiple part files to single part file, using FileUtil class from Hadoop File System.

Store RDD to S3

rdd.saveAsTextFile("s3n://bucket/path/")

Also, check this

Upvotes: -1

TheM00s3
TheM00s3

Reputation: 3711

Try doing this:

rdd.coalesce(1, shuffle = true).saveAsTextFile(...)

My understanding is that the shuffle = true argument will cause this to occur in parallel so it will output a single text file, but do be careful with massive data files.

Here are some more details on this issue at hand.

Upvotes: 5

Related Questions