Bharath
Bharath

Reputation: 477

Combine output of parallel operations using Scala

The following snippet of code processes the filters in parallel and write out individual files to the output directory.Is there a way to get one large output file?

Array(
        (filter1, outputPathBase + fileName),
        (filter2, outputPathBase + fileName),
        (filter3, outputPathBase + fileName)   

      ).par.foreach {
        case (extract, path) => extract.coalesce(1).write.mode("append").csv(path)
      }

Thank you.

Upvotes: 0

Views: 171

Answers (1)

Mikel San Vicente
Mikel San Vicente

Reputation: 3863

You can reduce the array into a single RDD by union them, that would parallelize the execution of each filter* by Spark

val rdd = Array(
        filter1
        filter2,
        filter3).reduce(_.union(_))

rdd.write.mode("append").csv(path)

There is no need in this case to convert the Array to a ParArray

I am assuming that filter1, filter2 and filter3 are of the same type RDD[T]

Upvotes: 1

Related Questions