Reputation: 831
I have an RDD with 3 values
rdd = rdd.map(lambda x: (x['Id'],[float(x['value1']),int(x['value2'])]))
I want to find and return the entire RDD where value1 is maximised I know i could do
rddMax = rdd.map(lambda x: (x['Id'], int(x['value1']))).reduceByKey(max)
and then join it back but i just want one clean operation which finds max value of 2 grouped by the key and then return the entire RDD associated with these values.
I also do no want to put the data in dataframe under any circumstances
thanks
Upvotes: 0
Views: 2459
Reputation:
Try this:
>>> rdd = rdd.map(lambda x:
... (x['key'], (float(x['value1']), int(x['value2']))))
>>> rdd.reduceByKey(
... lambda (v11, v21), (v12,v22): (v11, v21) if v11 > v12 else (v12, v22))
Upvotes: 3