Reputation: 1932
I'm nooby in Pyspark and I pretend to play a bit with a couple of functions to understand better how could I use them in more realistic scenarios. for a while, I trying to apply a specific function to each number coming in a RDD. My problem is basically that, when I try to print what I grabbed from my RDD the result is None
My code:
from pyspark import SparkConf , SparkContext
conf = SparkConf().setAppName('test')
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
changed = []
def div_two (n):
opera = n / 2
return opera
numbers = [8,40,20,30,60,90]
numbersRDD = sc.parallelize(numbers)
changed.append(numbersRDD.foreach(lambda x: div_two(x)))
#result = numbersRDD.map(lambda x: div_two(x))
for i in changed:
print(i)
I appreciate a clear explanation about why this is coming Null in the list and what should be the right approach to achieve that using foreach whether it's possible.
thanks
Upvotes: 6
Views: 24091
Reputation: 41957
Your function definition of div_two
seems fine which can yet be reduced to
def div_two (n):
return n/2
And you have converted the arrays of integers to rdd which is good too.
The main issue is that you are trying to add rdds to an array changed by using foreach
function. But if you look at the definition of foreach
def foreach(self, f) Inferred type: (self: RDD, f: Any) -> None
which says that the return type is None
. And thats what is getting printed.
You don't need an array variable for printing the changed elements of an RDD. You can simply write a function for printing and call that function in foreach
function
def printing(x):
print x
numbersRDD.map(div_two).foreach(printing)
You should get the results printed.
You can still add the rdd
to an array
variable but rdds
are distributed collection in itself and Array
is a collection too. So if you add rdd to an array you will have collection of collection which means you should write two loops
changed.append(numbersRDD.map(div_two))
def printing(x):
print x
for i in changed:
i.foreach(printing)
The main difference between your code and mine is that I have used map (which is a transformation) instead of foreach ( which is an action) while adding rdd to changed variable. And I have use two loops for printing the elements of rdd
Upvotes: 6