Reputation: 35
I met a problem while using spark with python3 in my project. In a Key-Value pair, like ('1','+1 2,3')
, the part "2,3"
was the content I wanted to check. So I wrote the following code:
(Assume this key-Value pair was saved in a RDD called p_list)
def add_label(x):
label=x[1].split()[0]
value=x[1].split()[1].split(",")
for i in value:
return (i,label)
p_list=p_list.map(add_label)
After doing like that, I could only get the result: ('2','+1')
and it should be ('2','+1')
and ('3','+1')
. It seems like that the "for" loop in map operation just did once. How can I let it do multiple times? Or is there any other way I can use to implement such a function like "for" loop in map operation or reduce operation?
I want to mention that what I really deal with is a large dataset. So I have to use AWS cluster and implement the loop with parallelization. The slave nodes in the cluster seem not to understand the loop. How can I let them know that with Spark RDD function? Or how can have such a loop operation in another pipeline way (which is one of the main design of Spark RDD)?
Upvotes: 3
Views: 56539
Reputation: 5151
Your return statement cannot be inside the loop; otherwise, it returns after the first iteration, never to make it to the second iteration.
What you could try is this
result = []
for i in value:
result.append((i,label))
return result
and then result
would be a list of all of the tuples created inside the loop.
Upvotes: 2