Scala Spark execution of RDD contiguous subsets

Question

It's almost 4 days I'm struggling with this problem and I cannot find an efficient solution.

I have an RDD in Spark in the form RDD[(Int, (Date, Double)] (the first value is just an index).

What do you think is the most efficient way in Spark to get an RDD as output where each element is some sort of function applied to the elements of all subsets composed by n contiguous elements within the input RDD?

For instance, given as function the average and n = 5 the result should be:

input:  [1.0, 2.0, 3.0, 2.0, 6.0, 4.0, 3.0, 4.0, 3.0, 2.0]
output: [                    2.8, 3.4, 3.6, 3.8, 4.0, 3.2]

Because:

1.0 + 2.0 + 3.0 + 2.0 + 6.0 = 14.0 / 5 = 2.8
2.0 + 3.0 + 2.0 + 6.0 + 4.0 = 17.0 / 5 = 3.4
3.0 + 2.0 + 6.0 + 4.0 + 3.0 = 18.0 / 5 = 3.6
2.0 + 6.0 + 4.0 + 3.0 + 4.0 = 19.0 / 5 = 3.8
6.0 + 4.0 + 3.0 + 4.0 + 3.0 = 20.0 / 5 = 4.0
4.0 + 3.0 + 4.0 + 3.0 + 2.0 = 16.0 / 5 = 3.2

This would be a very easy problem to solve but in Scala and Spark I'm very new and I don't know which would be the best practice in this case.

I tried a lot of solution including a kind of nested map() approach but of course Spark does not allow this behaviour. Some of them work but are not very efficient.

What do you think is the best algorithmic way to solve this problem in Scala Spark?

sachav · Accepted Answer

You can use mllib's sliding function:

import org.apache.spark.mllib.rdd.RDDFunctions._

val rdd = sc.parallelize(Seq(1.0, 2.0, 3.0, 2.0, 6.0, 4.0, 3.0, 4.0, 3.0, 2.0))
def average(x: Array[Double]) = x.sum / x.length
rdd.sliding(5).map(average).collect.mkString(", ") // 2.8, 3.4, 3.6, 3.8, 4.0, 3.2

Scala Spark execution of RDD contiguous subsets

Answers (1)

Related Questions