Spark Streaming: cannot collect() a DStream when running in cluster mode

Question

I have a strange issue when running code via the PySpark bindings to Spark 1.3.1.

Consider the code below:

sc = SparkContext("local[1]")
ssc = StreamingContext(sc, 10)

myStream = ssc.socketTextStream("localhost", 4663)

def f(rdd):
    rows = rdd.collect()
    for r in rows:
        print r

myStream.foreachRDD(f)

ssc.start()             
ssc.awaitTermination()

Now if I run the code above and connect via nc -lk 4663, the text I type is printed out on the console of the machine running Spark. Great.

However, if I make a single change to the first line of the code: sc = SparkContext() (which will cause it to launch in cluster mode with the driver running on the local machine), my text does not get printed to the console - though I can see messages like

INFO BlockManagerMaster: Updated info of block input-0-1447438549400

being printed to the console, so I know it is still picking up the text that is coming in over the TCP port.

This is odd, because the collect() action should force the RDDs in the DStream to be returned to the driver, so I think I should see the text.

Can anyone help me here? What am I doing wrong?

With many thanks,

Will

Spark Streaming: cannot collect() a DStream when running in cluster mode

Answers (1)

Related Questions