Guforu
Guforu

Reputation: 4023

Transform RDD in PySpark

For example I have next RDD of the type ((i,j), k):

((0,0), 0)
((0,1), 0)
((1,0), 0)
((1,1), 0)

I want to transform it to another one, which has 1 if i==j. My first attempt is going wrong:

rddnew = rdd.flatMap(lambda ((i,j), k): [if i==j: ((i,j), 1)]))

Can somebody help me to improve this code in python?

Upvotes: 1

Views: 341

Answers (1)

eliasah
eliasah

Reputation: 40360

Here is a solution :

data = [((0, 0), 0), ((0, 1), 0), ((1, 0), 0), ((1, 1), 0)]
rdd = sc.parallelize(data)
rdd2 = rdd.map(lambda ((i, j), _): ((i, j), 1) if (i == j) else ((i, j), 0))
rdd2.collect()
# [((0, 0), 1), ((0, 1), 0), ((1, 0), 0), ((1, 1), 1)]

You can also define a cleaner solution by using a function on the mapper :

def transformation(entry):
    (i, j), v = entry
    return (i, j), v + 1 if i == j else 0

rdd3 = rdd.map(transformation)
rdd3.collect()
# [((0, 0), 1), ((0, 1), 0), ((1, 0), 0), ((1, 1), 1)]

Upvotes: 3

Related Questions