Renaming and Moving Spark output file in AWS taking very yvery long tme

Question

I have a spark job where I have huge file as output 300 gb to S3 . My requirement is to rename all part files and then we have to move to final folder .

I did research but could not found solution where in spark job itself I can rename my spark output files .

So I came up with a plan where read back spark output files from S3 and then rename it again and then write back again in S3 folder .

But the issue my spark job takes 25 minutes to complete but reading ,renaming and copying again in S3 takes 45 minutes if time .

This is so frustrating for me .

Is there anyway I can make this process faster ? The issue is after spark job this process runs on only core node so it takes very long time .

This is what I do .

 val file = fs.globStatus(new Path(outputFileURL + "/*/*/*"))
for (urlStatus <- file) {

      val DataPartitionName = urlStatus.getPath.toString.split("=")(1).split("\/")(0).toString
      val StatementTypeCode = urlStatus.getPath.toString.split("=")(2).split("\/")(0).toString

      val finalFileName = finalPrefix + DataPartitionName + "." + StatementTypeCode+ "."  + fileVersion + currentTime + fileExtention
      val dest = new Path(mainFileURL + "/" + finalFileName)
      fs.rename(urlStatus.getPath, dest)

    }
    println("File renamed and moved to dir now delete output folder")
    myUtil.Utility.DeleteOuptuFolder(fs, outputFileURL)

Is there any way to leverage below two options

S3 DIST CP command ? as far as I have researched I did not found to rename files in S3 dist CP. I am doing renaming based in the files path .
Can I use shell command activity to read rename and copy ?

Renaming and Moving Spark output file in AWS taking very yvery long tme

Answers (1)

Related Questions