pySpark forEach function on a key

Question

I can't seem to find much documentation on forEach. I have a dataset that is in key/value pairing. I am looking to do something like (pseudo code):

forEach key, sum the value forEach key, max of the values etc.

dpeacock · Accepted Answer

This can be done e.g. with reduceByKey

rdd = sc.parallelize([("foo", 1), ("foo", 2), ("bar", 3)])

rdd.reduceByKey(lambda x, y : x + y).collect() # Sum for each key
# Gives [('foo', 3), ('bar', 3)]  

x.reduceByKey(max).collect() # Max for each key
# Gives [('foo', 2), ('bar', 3)]

pySpark forEach function on a key

Answers (2)

Related Questions