Reputation: 8967
I have gone through this stackoverflow question, as per the answer it creates a DStream
with only one RDD
for the batch interval.
For example:
My batch interval is 1 minute and Spark Streaming job is consuming data from Kafka Topic.
My question is, does the RDD available in DStream pulls/contains the entire data for the last one minute? Is there any criteria or options we need to set to pull all the data created for the last one minute?
If i have a Kafka topic with 3 partitions, and all the 3 partitions contains the data for the last one minute, will the DStream pulls/contains all the data created for the last one minute in all the Kafka topic partitions?
Update:
In Which case DStream contains more than one RDD?
Upvotes: 3
Views: 900
Reputation: 149628
One important thing which was overlooked is the fact that Kafka has multiple implementations for Spark Streaming.
One is the receiver based approach, which sets up a receiver on a selected Worker node and reads the data, buffers it and then distributes.
The other is the receiver-less approach, which is quite different. It consumes only offsets in the node running the driver, and then when it distributes the tasks it sends each each executor a range of offsets to read and process. This way, there is no buffering (hence, receiver-less), and each of the offsets are consumed by mutually exclusive executor processes running on the worker.
DStream pulls/contains all the data created for the last one minute in all the Kafka topic partitions?
In both approaches, it will. One the one minute interval hits, it will attempt to read the data from Kafka and spread it across the cluster for processing.
In which case DStream contains more than one RDD
As others said, it never does. Only a single RDD flows inside a DStream
at a give interval elapse.
Upvotes: 1
Reputation: 74769
A Spark Streaming DStream is consuming data from a Kafka topic that is partitioned, say to 3 partitions on 3 different Kafka brokers.
Does the RDD available in DStream pulls/contains the entire data for the last one minute?
Not quite. The RDD only describes the offsets to read data from when tasks are submitted for execution. It is just like with the other RDDs in Spark where they are only (?) a description of what to do and where to find data to work on when their tasks are submitted.
If you however use "pulls/contains" in a more loose way to express that at some point the records (from the partitions at given offsets) are going to be processed, yes, you're right, the entire minute is mapped to offsets and the offsets are in turn mapped to records that Kafka hands over to process.
in all the Kafka topic partitions?
Yes. It's Kafka not necessarily Spark Streaming / DStream / RDD to handle it. A DStream's RDDs request records from topic(s) and their partitions per offsets, from the last time it queried to now.
The minute for Spark Streaming might be slightly different for Kafka since a DStream's RDDs contain records for offsets not records per time.
In which case DStream contains more than one RDD?
Never.
Upvotes: 3
Reputation: 2406
I recommend to read more about DStream
abstraction in Spark documentation.
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data [...]. Internally, a DStream is represented by a continuous series of RDDs.
I would add one point to that – don't forget that RDD itself is another layer of abstraction and so it can be divided into smaller chunks and spread across the cluster.
Considering your questions:
Upvotes: 2