Chenxi Zeng
Chenxi Zeng

Reputation: 487

Difference between RDD.foreach() and RDD.map()

I am learning Spark in Python and wondering can anyone explain the difference between the action foreach() and transformation map()?

rdd.map() returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach() function and understand the differences. Thanks!

Upvotes: 11

Views: 29661

Answers (3)

Lawhatre
Lawhatre

Reputation: 1450

TL;DR

  1. foreach() is an Action while map() is a Transformation
  2. foreach() is used for side-effect operations while map() i used for non-side-effect operations
  3. foreach() returns None while map() returns RDD

Long version

To understand foreach we first need to understand the side-effect operations. These operations are the processes that changes the state of the system such as

  1. writing the data to external file or database
  2. update a variable

foreach is used in operations which are side-effect operations. They have NoneType as return.

For example:

acc = sc.accumulator(0)

def add_to_accumulator(x): 
    global acc
    acc += x

sc.parallelize(range(5)).foreach(add_to_accumulator)

print(acc)

>> 10

On the other hand map is used for element wise mapping and doesn't have NoneType as return.

sc.parallelize(range(5)).map(lambda x: x**2)

>> [0, 1, 4, 9 , 16, 25]

Upvotes: 0

xmorera
xmorera

Reputation: 1961

Map is a transformation, thus when you perform a map you apply a function to each element in the RDD and return a new RDD where additional transformations or actions can be called.

Foreach is an action, it takes each element and applies a function, but it does not return a value. This is particularly useful in you have to call perform some calculation on an RDD and log the result somewhere else, for example a database or call a REST API with each element in the RDD.

For example let's say that you have an RDD with many queries that you wish to log in another system. The queries are stored in an RDD.

queries = <code to load queries or a transformation that was applied on other RDDs>

Then you want to save those queries in another system via a call to another API

import urllib2

def log_search(q):
    response = urllib2.urlopen('http://www.bigdatainc.org/save_query/' + q)

queries.foreach(log_search)

Now you have executed the log_query on each element of the RDD. If you have done a map, nothing would have happened yet, until you called an action.

Upvotes: 5

Oliver Dain
Oliver Dain

Reputation: 9963

A very simple example would be rdd.foreach(print) which would print the value of each row in the RDD but not modify the RDD in any way.

For example, this produces an RDD with the numbers 1 - 10:

>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

The map call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach that would be useless because foreach doesn't modify the rdd in any way:

>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>

Conversely, calling map on a function that returns None like print isn't very useful:

>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]

The print call returns None so mapping that just gives you a bunch of None values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1, 2, etc. are the print being executed and they don't show up until you call take since the RDD is executed lazily. However the contents of the RDD are just a bunch of None.

More simply, call map if you care about the return value of the function. Call foreach if you don't.

Upvotes: 19

Related Questions