Reputation: 151

Pyspark Array Key,Value

I currently have an RDD with an array that stores a key-value pair where the key is the 2D indices of the array and the value is the number at that spot. For example [((0,0),1),((0,1),2),((1,0),3),((1,1),4)] I want to add up the values of each key with the surrounding values. In relation to my earlier example, I want to add up 1,2,3 and place it in the (0,0) key value spot. How would I do this?

Upvotes: 0

Answers (1)

nikkitousen

Reputation: 41

I would suggest you do the following:

Define a function that, given a pair (i,j), returns a list with the pairs corresponding to the positions surrounding (i,j), plus the input pair (i,j). For instance, lets say the function is called surrounding_pairs(pair). Then:
```
surrounding_pairs((0,0)) = [ (0,0), (0,1), (1,0) ]
surrounding_pairs((2,3)) = [ (2,3), (2,2), (2,4), (1,3), (3,3) ]
```
Of course, you need to be careful and return only valid positions.

Use a flatMap on your RDD as follows:

MyRDD = MyRDD.flatMap(lambda (pos, v): [(p, v) for p in surrounding_pairs(pos)])

This will map your RDD from [((0,0),1),((0,1),2),((1,0),3),((1,1),4)] to

[((0,0),1),((0,1),1),((1,0),1),
 ((0,1),2),((0,0),2),((1,1),2),
 ((1,0),3),((0,0),3),((1,1),3),
 ((1,1),4),((1,0),4),((0,1),4)]

This way, the value at each position will be "copied" to the neighbour positions.

Finally, just use a reduceByKey to add the corresponding values at each position:
```
from operator import add
MyRDD = MyRDD.reduceByKey(add)
```

I hope this makes sense.

Upvotes: 0

Pyspark Array Key,Value

Answers (1)

Related Questions