Reputation: 19

Pyspark - Max / Min Parameter

I have a query. In Pyspark when we need to get total(SUM) based on (Key,Value), our query reads like:

RDD1 = RDD.reduceByKey(lambda x , y: x + y)

where as when we need to find MAX / MIN value for (Key,Value) our query reads like

RDD1 = RDD.reduceByKey(lambda x , y: x if x[1] >= y[1] else y)

Why when we Sum data not using x[1], Y[1], where as same is use for MAX / MIN?. Please clarify the doubt.

Rgd's

Upvotes: 1

Answers (1)

Reputation: 1712

You're wrong and you've taken this code out of context. In both cases x and y refer to values.

lambda x , y: x if x[1] >= y[1] else y

is equivalent to:

lambda x, y: max(x, y, key=lambda x: x[1])

It compares values by their second element and means that each value:

Example

sc.parallelize([(1, ("a", -3)), (1, ("b", 3))]) \
  .reduceByKey(lambda x , y: x if x[1] >= y[1] else y).first()

will be (1, ('b', 3)) because 3 is larger than -3.

Upvotes: 1