Reputation: 470
In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use
val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))
I do nothing to the log but save it as a text file by using
log.coalesce(1, true).saveAsTextFile(args(args.size - 1))
but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file?
Upvotes: 7
Views: 15357
Reputation: 21499
As mentioned your problem is somewhat unavoidable via the standard API's as the assumption is that you are dealing with large quanatities of data. However, if I assume your data is manageable you could try the following
import java.nio.file.{Paths, Files}
import java.nio.charset.StandardCharsets
Files.write(Paths.get("./test_file"), data.collect.mkString("\n").getBytes(StandardCharsets.UTF_8))
What I am doing here is converting the RDD into a String by performing a collect and then mkString. I would suggest not doing this in production. It works fine for local data analysis (Working with 5gb~ of local data)
Upvotes: 0
Reputation: 2444
Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. The number of output files depends on the number of reducers.
How to "solve" it in Hadoop: merge output files after reduce phase
How to "solve" in Spark: how to make saveAsTextFile NOT split output into multiple file?
A good info you can get also here: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html
So, you were right about coalesce(1,true)
. However, it is very inefficient. Interesting is that (as @climbage mentioned in his remark) your code is working if you run it locally.
What you might try is to read the files first and then save the output.
...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
val file = sc.textFile(args(i))
file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")
Note: this code is also extremely inefficient and working for small files only!!! You need to come up with a better code. I wouldn't try to reduce number of file but process multiple outputs files instead.
Upvotes: 2