Reputation: 701
I have a spark job where I have huge file as output 300 gb to S3 . My requirement is to rename all part files and then we have to move to final folder .
I did research but could not found solution where in spark job itself I can rename my spark output files .
So I came up with a plan where read back spark output files from S3 and then rename it again and then write back again in S3 folder .
But the issue my spark job takes 25 minutes to complete but reading ,renaming and copying again in S3 takes 45 minutes if time .
This is so frustrating for me .
Is there anyway I can make this process faster ? The issue is after spark job this process runs on only core node so it takes very long time .
This is what I do .
val file = fs.globStatus(new Path(outputFileURL + "/*/*/*"))
for (urlStatus <- file) {
val DataPartitionName = urlStatus.getPath.toString.split("=")(1).split("\\/")(0).toString
val StatementTypeCode = urlStatus.getPath.toString.split("=")(2).split("\\/")(0).toString
val finalFileName = finalPrefix + DataPartitionName + "." + StatementTypeCode+ "." + fileVersion + currentTime + fileExtention
val dest = new Path(mainFileURL + "/" + finalFileName)
fs.rename(urlStatus.getPath, dest)
}
println("File renamed and moved to dir now delete output folder")
myUtil.Utility.DeleteOuptuFolder(fs, outputFileURL)
Is there any way to leverage below two options
S3 DIST CP command ? as far as I have researched I did not found to rename files in S3 dist CP. I am doing renaming based in the files path .
Can I use shell command activity to read rename and copy ?
Upvotes: 0
Views: 741
Reputation: 1130
The problem is that S3 rename is actually implemented as a copy-and-delete, so it will take longer if you have a lot of large files.
I would suggest writing to HDFS with spark, then doing your filename manipulations locally on HDFS where you actually have atomic rename semantics, and then using S3DistCp to copy the now-correctly-named-files to the destination locations, and then deleting the files on HDFS if you need the space.
Upvotes: 1