RDD creation and variable binding

Question

I have a very simple code:

def fun(x, n):
    return (x, n)

rdds = []
for i in range(2):
    rdd = sc.parallelize(range(5*i, 5*(i+1)))
    rdd = rdd.map(lambda x: fun(x, i))
    rdds.append(rdd)

a = sc.union(rdds)
print a.collect()

I had expected the output to be the following:

[(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]

However, the output is the following:

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]

This is bewildering, to say the least.

It seems, due to lazy evaluation of RDDs, the value of i that is being used to create RDDs is the one it bears when collect() is called, which is 1 (from the last run of the for loop).

Now, both elements of the tuple are derived from i.

But it seems, for the first element of the tuple, i bears values 0 and 1 while for the second element of the tuple i bears the value 2.

Can somebody please explain what's happening?

Thanks.

Zhang Tong · Accepted Answer

just change

rdd = rdd.map(lambda x: fun(x, i))

to

rdd = rdd.map(lambda x, i=i: (x, i))

That is only about Python, look at this

https://docs.python.org/2.7/tutorial/controlflow.html#default-argument-values

RDD creation and variable binding

Answers (2)

Related Questions