Why does Spark Streaming start jobs only after it gets terminated?

Question

I have a very simple spark streaming job running locally in standalone mode. There is a customer receiver which reads from database and pass it to the main job which prints the total. Not an actual use case but I am playing around to learn. Problem is that job gets stuck forever, logic is very simple so I think it is neither doing any processing nor memory issue. What is strange is if I STOP the job, suddenly in logs I see the output of job execution and other backed jobs follow! Can some one help me understand what is going on here?

 val spark = SparkSession
  .builder()
  .master("local[1]")
  .appName("SocketStream")
  .getOrCreate()

val ssc = new StreamingContext(spark.sparkContext,Seconds(5))
val lines = ssc.receiverStream(new HanaCustomReceiver())


lines.foreachRDD{x => println("==============" + x.count())}

ssc.start()
ssc.awaitTermination()

After terminating program following logs roll which shows execution of the batch -

17/06/05 15:56:16 INFO JobGenerator: Stopping JobGenerator immediately
17/06/05 15:56:16 INFO RecurringTimer: Stopped timer for JobGenerator after time 1496696175000
17/06/05 15:56:16 INFO JobGenerator: Stopped JobGenerator
==============100

Jacek Laskowski · Accepted Answer

TL;DR Use local[2] at the very least.

The issue is the following line:

.master("local[1]")

You should have at least 2 threads to use for receivers in Spark Streaming or your streaming jobs don't get even a chance to start as they get stuck waiting for resources, i.e. a free thread to allocate.

Quoting Spark Streaming's A Quick Example:

// The master requires 2 cores to prevent from a starvation scenario.

My recommendation is to use local[2] (the minimum) or local[*] to take as many cores as possible. The best solution is to use a cluster manager like Apache Mesos, Hadoop YARN or Spark Standalone.

Why does Spark Streaming start jobs only after it gets terminated?

Answers (1)

Related Questions