Reputation: 19
I have a query. In Pyspark when we need to get total(SUM) based on (Key,Value), our query reads like:
RDD1 = RDD.reduceByKey(lambda x , y: x + y)
where as when we need to find MAX / MIN value for (Key,Value) our query reads like
RDD1 = RDD.reduceByKey(lambda x , y: x if x[1] >= y[1] else y)
Why when we Sum data not using x[1]
, Y[1]
, where as same is use for MAX / MIN?. Please clarify the doubt.
Rgd's
Upvotes: 1
Views: 936
Reputation: 1712
You're wrong and you've taken this code out of context. In both cases x
and y
refer to values.
lambda x , y: x if x[1] >= y[1] else y
is equivalent to:
lambda x, y: max(x, y, key=lambda x: x[1])
It compares values by their second element and means that each value:
__getitem__
).Example
sc.parallelize([(1, ("a", -3)), (1, ("b", 3))]) \
.reduceByKey(lambda x , y: x if x[1] >= y[1] else y).first()
will be (1, ('b', 3))
because 3 is larger than -3.
Upvotes: 1