Reputation: 2074
I can't seem to find much documentation on forEach. I have a dataset that is in key/value pairing. I am looking to do something like (pseudo code):
forEach key, sum the value forEach key, max of the values etc.
Upvotes: 0
Views: 5797
Reputation: 6660
Please take a look at Spark Programming Guide
foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems.
Please note the highlighted "side effect". foreach is an action on RDD that perform function on each element in the RDD, but does not return anything to the driver. You can pass function to it such as println or increase accumulator variables or save to external storage.
In your use case, you should use reduceByKey.
Upvotes: 2
Reputation: 2747
This can be done e.g. with reduceByKey
rdd = sc.parallelize([("foo", 1), ("foo", 2), ("bar", 3)])
rdd.reduceByKey(lambda x, y : x + y).collect() # Sum for each key
# Gives [('foo', 3), ('bar', 3)]
x.reduceByKey(max).collect() # Max for each key
# Gives [('foo', 2), ('bar', 3)]
Upvotes: 2