3xCh1_23
3xCh1_23

Reputation: 1499

How to carry-over the calculated value within the RDD ? -Apache spark

SOLVED: There is no good solution to this problem

I am sure that this is just a syntax-relevant question and that answer is an easy one.


What I am trying to achieve is to:

-pass a variable to RDD

-change the variable according to RDD data

-get the adjusted variable


Lets say I have:

var b = 2

val x = sc.parallelize(0 to 3)

what I want to do is to obtain the value 2+0 + 2+0+1 + 2+0+1+2 + 2+0+1+2+3 = 18 That is, the value 18 by doing something like

b = x.map(i=> … b+i...).collect

The problem is, for each i, I need to carry over the value b, to be incremented with the next i

I want to use this logic for adding the elements to an array that is external to RDD

How would I do that without doing the collect first ?

Upvotes: 0

Views: 147

Answers (1)

maasg
maasg

Reputation: 37435

As mentioned in the comments, it's not possible to mutate one variable with the contents of an RDD as RDDs are distributed across potentially many different nodes while mutable variables are local to each executor (JVM).

Although not particularly performant, it's possible to implement these requirements on Spark by translating the sequential algorithm in a series of transformations that can be executed in a distributed environment.

Using the same example as on the question, this algorithm in Spark could be expressed as:

val initialOffset = 2
val rdd = sc.parallelize(0 to 3)
val halfCartesian = rdd.cartesian(rdd).filter{case (x,y) => x>=y}
val partialSums = halfCartesian.reduceByKey(_ + _) 
val adjustedPartials = partialSums.map{case (k,v) => v+initialOffset}
val total = adjustedPartials.reduce(_ + _)

scala> total
res33: Int = 18

Note that cartesian is a very expensive transformation as it creates (m x n) elements, or in this case n^2.
This is just to say that it's not impossible, but probably not ideal.

If the amount of data to be processed sequentially would fit in the memory of one machine (maybe after filtering/reduce), then Scala has a built-in collection operation to realize exactly what's being asked: scan[Left|Right]

val arr = Array(0,1,2,3)
val cummulativeScan = arr.scanLeft(initialOffset)(_ + _)
// we remove head b/c scan adds the given element at the start of the sequence 
val result = cummulativeScan.tail.sum

result: Int = 18

Upvotes: 1

Related Questions