yjgong
yjgong

Reputation: 11

Save two or more different RDDs in a single text file in scala

When I use saveAsTextFile like,

rdd1.saveAsTextFile("../savefile")
rdd2.saveAsTextFile("../savefile")

I can't put two different RDDs into a single text file. Is there a way I can do so?

Besides, is there a way I can apply some format to the text I am wring to the text file? For example, add a \n or some other format.

Upvotes: 1

Views: 1444

Answers (1)

zero323
zero323

Reputation: 330093

  1. A single text file is rather ambiguous in Spark. Each partition is saved individually and it means you get a single file per partition. If you want a single for a RDD you have to move your data to a single partition or collect, and most of the time it is either to expensive or simply not feasible.

  2. You can get an union of RDDs using union method (or ++ as mentioned by lpiepiora in the comments) but it works only if both RDDs are of the same type:

    val rdd1 = sc.parallelize(1 to 5)
    val rdd2 = sc.parallelize(Seq("a", "b", "c", "d", "e"))
    rdd1.union(rdd2)
    
    // <console>:26: error: type mismatch;
    //  found   : org.apache.spark.rdd.RDD[String]
    //  required: org.apache.spark.rdd.RDD[Int]
    //               rdd1.union(rdd2)
    

    If types are different a whole idea smells fishy though.

  3. If you want a specific format you have to apply it before calling saveAsTextFile. saveAsTextFile simply calls toString on each element.

Putting all of the above together:

import org.apache.spark.rdd.RDD

val rddStr1: RDD[String] = rdd1.map(x => ???) // Map to RDD[String]
val rddStr2: RDD[String] = rdd2.map(x => ???)

rdd1.union(rdd2)
  .repartition(1) // Not recommended!
  .saveAsTextFile(some_path)

Upvotes: 1

Related Questions