theMadKing
theMadKing

Reputation: 2074

pySpark forEach function on a key

I can't seem to find much documentation on forEach. I have a dataset that is in key/value pairing. I am looking to do something like (pseudo code):

forEach key, sum the value forEach key, max of the values etc.

Upvotes: 0

Views: 5797

Answers (2)

Lan
Lan

Reputation: 6660

Please take a look at Spark Programming Guide

foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems.

Please note the highlighted "side effect". foreach is an action on RDD that perform function on each element in the RDD, but does not return anything to the driver. You can pass function to it such as println or increase accumulator variables or save to external storage.

In your use case, you should use reduceByKey.

Upvotes: 2

dpeacock
dpeacock

Reputation: 2747

This can be done e.g. with reduceByKey

rdd = sc.parallelize([("foo", 1), ("foo", 2), ("bar", 3)])

rdd.reduceByKey(lambda x, y : x + y).collect() # Sum for each key
# Gives [('foo', 3), ('bar', 3)]  

x.reduceByKey(max).collect() # Max for each key
# Gives [('foo', 2), ('bar', 3)]

Upvotes: 2

Related Questions