Prashant_J
Prashant_J

Reputation: 354

Execution order of code in Spark Streaming

In Spark Streaming val lines = ssc.socketTextStream("localhost",1234) line will create a DStream(collections of Rdd's) but I am confused that as there is always a sequential execution of code that is line by line,then how the above line of code will keep on generating DStream. And after this when an transformation is applied "val words = lines.print()" how this line is printing all the data present in a given Rdd as this val lines = ssc.socketTextStream("localhost",1234) is already in running mode. Is in this process infinite looping is performed? Please tell me the flow of Spark Streaming process Thanks

Upvotes: 0

Views: 1133

Answers (1)

T. Gawęda
T. Gawęda

Reputation: 16096

Spark's transformations are lazy. It means that nothing will be invoked before ana action will be called.

What it means for you?

  1. val lines = ssc.socketTextStream("localhost",1234) creates one DStream. Nothing is now read, just DStream is created with configuration from what source data will be read in future.

  2. On val words = lines.print() you're declaring that every RDD created by first point will be printed. words variable is now composed of 2 declarations: source of data (point 1) and first action that will be invoked (print()).

  3. Execution starts on ssc.start(). Spark then creates all necessary background classes, etc. - nothing you must know now :) Important thing is that Spark Context knows about every RDD created with this context. This StreamingContext has information about words DStream and after ssc.start() Spark will read data from source and then pass them into print() function.

In other words - you are not invoking actions/transformations that you're typing. You are just declaring that you want them to be invoked. It is something very similar to execution plan in SQL engine. Spark Context will execute actions and transformation that you want after start() is called and after each RDD from DStream is created by Spark. DStreams are created by Spark in some loop, but this is transparent for you

Upvotes: 1

Related Questions