Reputation: 11
When I use saveAsTextFile
like,
rdd1.saveAsTextFile("../savefile")
rdd2.saveAsTextFile("../savefile")
I can't put two different RDDs into a single text file. Is there a way I can do so?
Besides, is there a way I can apply some format to the text I am wring to the text file? For example, add a \n
or some other format.
Upvotes: 1
Views: 1444
Reputation: 330093
A single text file is rather ambiguous in Spark. Each partition is saved individually and it means you get a single file per partition. If you want a single for a RDD you have to move your data to a single partition or collect, and most of the time it is either to expensive or simply not feasible.
You can get an union of RDDs using union
method (or ++
as mentioned by lpiepiora in the comments) but it works only if both RDDs are of the same type:
val rdd1 = sc.parallelize(1 to 5)
val rdd2 = sc.parallelize(Seq("a", "b", "c", "d", "e"))
rdd1.union(rdd2)
// <console>:26: error: type mismatch;
// found : org.apache.spark.rdd.RDD[String]
// required: org.apache.spark.rdd.RDD[Int]
// rdd1.union(rdd2)
If types are different a whole idea smells fishy though.
If you want a specific format you have to apply it before calling saveAsTextFile
. saveAsTextFile
simply calls toString
on each element.
Putting all of the above together:
import org.apache.spark.rdd.RDD
val rddStr1: RDD[String] = rdd1.map(x => ???) // Map to RDD[String]
val rddStr2: RDD[String] = rdd2.map(x => ???)
rdd1.union(rdd2)
.repartition(1) // Not recommended!
.saveAsTextFile(some_path)
Upvotes: 1