Steven
Steven

Reputation: 15318

RDD foreach method provides no results

I am trying to understand how foreach method works. In my jupyter notebook, I tried :

def f(x): print(x)
a = sc.parallelize([1, 2, 3, 4, 5])
b = a.foreach(f)
print(type(b))
<class 'NoneType'>

I can execute that without any problem, but I don't have any output except the print(type(b)) part. The foreach doesn't return anything, just a none type. I do not know what foreach is supposed to do, and how to use it. Can you explain me what it is used for ?

Upvotes: 1

Views: 1736

Answers (2)

amitabh Srivastava
amitabh Srivastava

Reputation: 21

I just use the following method and it is working perfectly under Jupyter Notebook with PySpark:

for row in RDD.toLocalIterator():
    print(row)

Actually it will convert your RDD into a generator object and then by using this generator object you can easily iterate over each element. OR you can first create a generator object and then use it in your loop like below:

genobj = data.toLocalIterator()
for row in genobj:
    print(row)

Upvotes: 2

desertnaut
desertnaut

Reputation: 60390

foreach is an action, and does not return anything; so, you cannot use it as you do, i.e. assigning it to another variable like b = a.foreach(f). From Learning Spark, p. 41-42:

enter image description here

enter image description here

Adapting the simple example from the docs, run in a PySpark terminal:

>>> def f(x): print(x)
>>> a = sc.parallelize([1, 2, 3, 4, 5])
>>> a.foreach(f)
5
4
3
1
2

(NOTE: not sure about Jupyter, but the above code will not produce any print results in a Databricks notebook.)

You may also find the answers in this thread helpful.

Upvotes: 4

Related Questions