Ryan
Ryan

Reputation: 1232

Spark writing compressed CSV with custom path to S3

I'm trying to simply write a CSV to S3 using Spark written in Scala:

I notice in my output bucket the following file: ...PROCESSED/montfh-04.csv/part-00000-723a3d72-56f6-4e62-b627-9a181a820f6a-c000.csv.snappy

when it should only be montfh-04.csv

Code:

    val processedMetadataDf = spark.read.csv("s3://" + metadataPath + "/PROCESSED/" + "month-04" + ".csv")
    val processCount = processedMetadataDf.count()
    if (processCount == 0) {
        // Initial frame is 0B -> Overwrite with path 
        val newDat = Seq("dummy-row-data")
        val unknown_df = newDat.toDF()
        unknown_df.write.mode("overwrite").option("header","false").csv("s3://" + metadataPath + "/PROCESSED/" + "montfh-04" + ".csv")
        
    }

Here I notice two strange things:

All I am trying to do is simply write a flat CSV file with that name to the specified path. What are my options?

Upvotes: 0

Views: 411

Answers (1)

Akshit Methi
Akshit Methi

Reputation: 21

This is how spark works. The location you provide for saving DataSet/DataFrame is the directory location where spark can write all their partitions. No. of part files will be equal to no of partition which in your case is only 1.

Now, if you want the filename to be montfh-04.csv only then you can rename it.

Note: renaming in S3 is costly operation ( copy and delete). As you are writing with spark it will be 3 times of the I/O as 2 times will be the output Commit operation and 1 time rename. Better write it in HDFS and upload it from there with the required key name.

Upvotes: 2

Related Questions