Problems on Spark dealing with list of Python object

Question

I am learning Spark, and I just got a problem when I used Spark to deal with a list of Python object. The following is my code:

import numpy as np    
from pyspark import SparkConf, SparkContext

### Definition of Class A
class A:
    def __init__(self, n):
        self.num = n

### Function "display"
def display(s):
    print s.num
    return s

def main():
    ### Initialize the Spark
    conf = SparkConf().setAppName("ruofan").setMaster("local")
    sc = SparkContext(conf = conf)

    ### Create a list of instances of Class A
    data = []
    for i in np.arange(5):
        x = A(i)
        data.append(x)

    ### Use Spark to parallelize the list of instances
    lines = sc.parallelize(data)

    ### Spark mapping
    lineLengths1 = lines.map(display)

if __name__ == "__main__":
    main()

When I run my code, it seemed not printing the number of each instance (But it should have printed 0, 1, 2, 3, 4). I try to find the reasons, but I have no ideas on this. I would really appreciate if anyone help me.

zero323 · Accepted Answer

First of all display is never executed. RDDs are lazily evaluated so as long you don't perform an action (like collect, count or saveAsTextFile) nothing really happens.

Another part of the problem requires an understanding of Spark architecture. Simplifying things a little bit Driver program is responsible for SparkContext creation and sending tasks to the Worker Nodes. Everything that happens during transformations (in your case map) is executed on the Workers so the output of the print statement goes to the Worker stdout. If you want to obtain some kind of output you should consider using logs instead.

Finally if your goal is to get some kind of side effect it would be idiomatic to use foreach instead of map.

Problems on Spark dealing with list of Python object

Answers (1)

Related Questions