Linghao
Linghao

Reputation: 35

How can I use "for" loop in spark with pyspark

I met a problem while using spark with python3 in my project. In a Key-Value pair, like ('1','+1 2,3'), the part "2,3" was the content I wanted to check. So I wrote the following code:
(Assume this key-Value pair was saved in a RDD called p_list)


def add_label(x):   
    label=x[1].split()[0]  
    value=x[1].split()[1].split(",")  
    for i in value:     
        return (i,label)  
p_list=p_list.map(add_label)

After doing like that, I could only get the result: ('2','+1') and it should be ('2','+1') and ('3','+1'). It seems like that the "for" loop in map operation just did once. How can I let it do multiple times? Or is there any other way I can use to implement such a function like "for" loop in map operation or reduce operation?

I want to mention that what I really deal with is a large dataset. So I have to use AWS cluster and implement the loop with parallelization. The slave nodes in the cluster seem not to understand the loop. How can I let them know that with Spark RDD function? Or how can have such a loop operation in another pipeline way (which is one of the main design of Spark RDD)?

Upvotes: 3

Views: 56539

Answers (1)

Matt Cremeens
Matt Cremeens

Reputation: 5151

Your return statement cannot be inside the loop; otherwise, it returns after the first iteration, never to make it to the second iteration.

What you could try is this

result = []
for i in value:
    result.append((i,label))
return result

and then result would be a list of all of the tuples created inside the loop.

Upvotes: 2

Related Questions