Combine output of parallel operations using Scala

Question

The following snippet of code processes the filters in parallel and write out individual files to the output directory.Is there a way to get one large output file?

Array(
        (filter1, outputPathBase + fileName),
        (filter2, outputPathBase + fileName),
        (filter3, outputPathBase + fileName)   

      ).par.foreach {
        case (extract, path) => extract.coalesce(1).write.mode("append").csv(path)
      }

Thank you.

Mikel San Vicente · Accepted Answer

You can reduce the array into a single RDD by union them, that would parallelize the execution of each filter* by Spark

val rdd = Array(
        filter1
        filter2,
        filter3).reduce(_.union(_))

rdd.write.mode("append").csv(path)

There is no need in this case to convert the Array to a ParArray

I am assuming that filter1, filter2 and filter3 are of the same type RDD[T]

Combine output of parallel operations using Scala

Answers (1)

Related Questions