manie
manie

Reputation: 355

Append/concatenate two files using spark/scala

I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.

I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.

Thank you for your help

Upvotes: 5

Views: 6014

Answers (1)

Mehrez
Mehrez

Reputation: 695

You can do this with two methods:

 sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")

Or as @Pushkr has proposed

 new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")

If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)

Upvotes: 2

Related Questions