Reputation: 223
I saw several answers(e.g. here) in SO suggest that records in a batch will become a single RDD. I doubt it because suppose a batchInterval is 1 minute, then a single RDD will contain all data from last minute?
NOTE: I'm not directly comparing batch to RDD but rather the batch processed by Spark internally.
Upvotes: 0
Views: 302
Reputation: 11
Let me quote Spark Streaming guide
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval, as shown in the following figure.
As you can see - single batch = single RDD. This is why adjusting batch interval depending on your the data flow is crucial for the stability of your application.
Upvotes: 1