Dr.Pro
Dr.Pro

Reputation: 223

Processed batch vs RDD in Spark Streaming

I saw several answers(e.g. here) in SO suggest that records in a batch will become a single RDD. I doubt it because suppose a batchInterval is 1 minute, then a single RDD will contain all data from last minute?

NOTE: I'm not directly comparing batch to RDD but rather the batch processed by Spark internally.

Upvotes: 0

Views: 302

Answers (1)

user7922234
user7922234

Reputation: 11

Let me quote Spark Streaming guide

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval, as shown in the following figure.

enter image description here

As you can see - single batch = single RDD. This is why adjusting batch interval depending on your the data flow is crucial for the stability of your application.

Upvotes: 1

Related Questions