jtitusj
jtitusj

Reputation: 3084

Spark: Mapping elements of an RDD using other elements from the same RDD

Suppose I have an this rdd:

val r = sc.parallelize(Array(1,4,2,3))

What I want to do is create a mapping. e.g:

r.map(val => val + func(all other elements in r)).

Is this even possible?

Upvotes: 0

Views: 618

Answers (3)

Mario
Mario

Reputation: 1801

I don't know if there is a more efficient alternative, but I would first create some structure like:

rdd = sc.parallelize([ (1, [4,2,3]), (4, [1,2,3]), (2, [1,4,3]), (3, [1,4,2]));
rdd = rdd.map(lambda (x,y) => x + func(y));

Upvotes: 0

marios
marios

Reputation: 8996

Spark already supports Gradient Descent. Maybe you can take a look in how they implemented it.

Upvotes: 0

Alberto Bonsanto
Alberto Bonsanto

Reputation: 18042

It's very likely that you will get an exception, e.g. bellow.

rdd = sc.parallelize(range(100))
rdd = rdd.map(lambda x: x + sum(rdd.collect()))

i.e. you are trying to broadcast the RDD therefore.

Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

To achieve this you would have to do something like this:

res = sc.broadcast(rdd.reduce(lambda a,b: a + b))
rdd = rdd.map(lambda x: x + res.value)

Upvotes: 0

Related Questions