Reputation: 1070
I am learning Spark, and I just got a problem when I used Spark to deal with a list of Python object. The following is my code:
import numpy as np
from pyspark import SparkConf, SparkContext
### Definition of Class A
class A:
def __init__(self, n):
self.num = n
### Function "display"
def display(s):
print s.num
return s
def main():
### Initialize the Spark
conf = SparkConf().setAppName("ruofan").setMaster("local")
sc = SparkContext(conf = conf)
### Create a list of instances of Class A
data = []
for i in np.arange(5):
x = A(i)
data.append(x)
### Use Spark to parallelize the list of instances
lines = sc.parallelize(data)
### Spark mapping
lineLengths1 = lines.map(display)
if __name__ == "__main__":
main()
When I run my code, it seemed not printing the number of each instance (But it should have printed 0, 1, 2, 3, 4). I try to find the reasons, but I have no ideas on this. I would really appreciate if anyone help me.
Upvotes: 0
Views: 120
Reputation: 330283
First of all display
is never executed. RDDs are lazily evaluated so as long you don't perform an action (like collect
, count
or saveAsTextFile
) nothing really happens.
Another part of the problem requires an understanding of Spark architecture. Simplifying things a little bit Driver program is responsible for SparkContext
creation and sending tasks to the Worker Nodes. Everything that happens during transformations (in your case map
) is executed on the Workers so the output of the print statement goes to the Worker stdout. If you want to obtain some kind of output you should consider using logs instead.
Finally if your goal is to get some kind of side effect it would be idiomatic to use foreach
instead of map.
Upvotes: 1