Reputation: 487
I am learning Spark in Python and wondering can anyone explain the difference between the action foreach()
and transformation map()
?
rdd.map()
returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach()
function and understand the differences. Thanks!
Upvotes: 11
Views: 29661
Reputation: 1450
TL;DR
foreach()
is an Action while map()
is a Transformationforeach()
is used for side-effect operations while map()
i used for non-side-effect operationsforeach()
returns None while map()
returns RDDLong version
To understand foreach
we first need to understand the side-effect operations. These operations are the processes that changes the state of the system such as
foreach
is used in operations which are side-effect operations. They have NoneType as return.
For example:
acc = sc.accumulator(0)
def add_to_accumulator(x):
global acc
acc += x
sc.parallelize(range(5)).foreach(add_to_accumulator)
print(acc)
>> 10
On the other hand map
is used for element wise mapping and doesn't have NoneType as return.
sc.parallelize(range(5)).map(lambda x: x**2)
>> [0, 1, 4, 9 , 16, 25]
Upvotes: 0
Reputation: 1961
Map is a transformation, thus when you perform a map you apply a function to each element in the RDD and return a new RDD where additional transformations or actions can be called.
Foreach is an action, it takes each element and applies a function, but it does not return a value. This is particularly useful in you have to call perform some calculation on an RDD and log the result somewhere else, for example a database or call a REST API with each element in the RDD.
For example let's say that you have an RDD with many queries that you wish to log in another system. The queries are stored in an RDD.
queries = <code to load queries or a transformation that was applied on other RDDs>
Then you want to save those queries in another system via a call to another API
import urllib2
def log_search(q):
response = urllib2.urlopen('http://www.bigdatainc.org/save_query/' + q)
queries.foreach(log_search)
Now you have executed the log_query on each element of the RDD. If you have done a map, nothing would have happened yet, until you called an action.
Upvotes: 5
Reputation: 9963
A very simple example would be rdd.foreach(print)
which would print the value of each row in the RDD but not modify the RDD in any way.
For example, this produces an RDD with the numbers 1 - 10:
>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The map
call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach
that would be useless because foreach
doesn't modify the rdd in any way:
>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>
Conversely, calling map
on a function that returns None
like print
isn't very useful:
>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]
The print
call returns None
so mapping that just gives you a bunch of None
values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1
, 2
, etc. are the print
being executed and they don't show up until you call take
since the RDD is executed lazily. However the contents of the RDD are just a bunch of None
.
More simply, call map
if you care about the return value of the function. Call foreach
if you don't.
Upvotes: 19