Atharv Thakur
Atharv Thakur

Reputation: 701

Renaming and Moving Spark output file in AWS taking very yvery long tme

I have a spark job where I have huge file as output 300 gb to S3 . My requirement is to rename all part files and then we have to move to final folder .

I did research but could not found solution where in spark job itself I can rename my spark output files .

So I came up with a plan where read back spark output files from S3 and then rename it again and then write back again in S3 folder .

But the issue my spark job takes 25 minutes to complete but reading ,renaming and copying again in S3 takes 45 minutes if time .

This is so frustrating for me .

Is there anyway I can make this process faster ? The issue is after spark job this process runs on only core node so it takes very long time .

This is what I do .

 val file = fs.globStatus(new Path(outputFileURL + "/*/*/*"))
for (urlStatus <- file) {

      val DataPartitionName = urlStatus.getPath.toString.split("=")(1).split("\\/")(0).toString
      val StatementTypeCode = urlStatus.getPath.toString.split("=")(2).split("\\/")(0).toString

      val finalFileName = finalPrefix + DataPartitionName + "." + StatementTypeCode+ "."  + fileVersion + currentTime + fileExtention
      val dest = new Path(mainFileURL + "/" + finalFileName)
      fs.rename(urlStatus.getPath, dest)

    }
    println("File renamed and moved to dir now delete output folder")
    myUtil.Utility.DeleteOuptuFolder(fs, outputFileURL)

Is there any way to leverage below two options

  1. S3 DIST CP command ? as far as I have researched I did not found to rename files in S3 dist CP. I am doing renaming based in the files path .

  2. Can I use shell command activity to read rename and copy ?

Upvotes: 0

Views: 741

Answers (1)

Simplefish
Simplefish

Reputation: 1130

The problem is that S3 rename is actually implemented as a copy-and-delete, so it will take longer if you have a lot of large files.

I would suggest writing to HDFS with spark, then doing your filename manipulations locally on HDFS where you actually have atomic rename semantics, and then using S3DistCp to copy the now-correctly-named-files to the destination locations, and then deleting the files on HDFS if you need the space.

Upvotes: 1

Related Questions