How to execute function over each element in RDD and join outputs?

Question

I'm very new to Spark and I am using Spark 1.6.0.

I have an RDD which is: RDD[Array[Array[String], Long]]

I would like to run each element in the RDD through a function which takes Array[Array[String], Long] as input and returns a ListBuffer[Array[Int]] as output.

The computations of each element in the RDD can be done in parallel, they are not dependent on each other. However, once all of the elements of the RDD have been run through the function I would like to join all of the ListBuffer[Array[Int]] outputs together, into one single ListBuffer[Array[Int]] (order here is not relevant either, but they should all be in the same datastructure).

What would be the best way to do so? I could foreach the RDD and run them through the function, however then I'm not sure how to handle the output and thereafter perform this merge at the driver.

This seems to be possible with Accumulators. The function as mentioned earlier is not just a single line of code, it is something like 20+ lines. So if we have this function:

def func(data: Array[Array[String], Long]): ListBuffer[Array[Int]] {
    // create ListBuffer
    // iterate over data
        // do some operations on an element in data
        // add some entry to the ListBuffer
    // add the ListBuffer to the Accumulator or return ListBuffer?
}

How could I wrap this all up together? Can I do something like:

// create Accumulator
// RDD.foreach() // call the func and pass Accumulator as argument?

Or:

val accum = // a ListBuffer[Array[Int]] accumulator
RDD.foreach(x => accum.add(func(x)))

Jacek Laskowski · Accepted Answer

TL;DR Use map followed by collect if you need the results on the driver.

map applies a function to every element in an RDD.

map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U] Return a new RDD by applying a function to all elements of this RDD.

In your case, T is Array[Array[String], Long] with U being ListBuffer[Array[Int]]. If you have a function f that does the transformation of elements of type T, map is your friend.

How to execute function over each element in RDD and join outputs?

Answers (1)

Related Questions