Will Hardman
Will Hardman

Reputation: 193

Spark Streaming: cannot collect() a DStream when running in cluster mode

I have a strange issue when running code via the PySpark bindings to Spark 1.3.1.

Consider the code below:

sc = SparkContext("local[1]")
ssc = StreamingContext(sc, 10)

myStream = ssc.socketTextStream("localhost", 4663)

def f(rdd):
    rows = rdd.collect()
    for r in rows:
        print r

myStream.foreachRDD(f)

ssc.start()             
ssc.awaitTermination()  

Now if I run the code above and connect via nc -lk 4663, the text I type is printed out on the console of the machine running Spark. Great.

However, if I make a single change to the first line of the code: sc = SparkContext() (which will cause it to launch in cluster mode with the driver running on the local machine), my text does not get printed to the console - though I can see messages like

INFO BlockManagerMaster: Updated info of block input-0-1447438549400

being printed to the console, so I know it is still picking up the text that is coming in over the TCP port.

This is odd, because the collect() action should force the RDDs in the DStream to be returned to the driver, so I think I should see the text.

Can anyone help me here? What am I doing wrong?

With many thanks,

Will

Upvotes: 1

Views: 792

Answers (1)

Marius Soutier
Marius Soutier

Reputation: 11284

If by cluster mode you mean submitting your code using --deploy-mode cluster, the driver is not running on the master machine, but on one of the workers.

Check the documentation for more details.

Upvotes: 0

Related Questions