Reputation: 462
The batches in spark streaming are the batches of RDD .Suppose batch of 3 RDDs.
Also spark documentation says that a block is created every 200ms by reciever , and partition is allotted to the block.
Say in 1 second I have batch of 3 RDDs , with 5 blocks if 200ms is considered.
So how will a RDD get partitioned across worker nodes , is the single RDD that will be partitioned or a complete batch.
I may have taken it in a wrong way . Please guide me
Upvotes: 8
Views: 7030
Reputation: 191
Is this still applicable in newer version of spark?
I read an article where the scenario with multiple receivers on spark is outdated and instead new direct kafka api (createDirectStream
) would take care of pretty much everything for you.
Upvotes: 3
Reputation: 37435
One streaming batch corresponds to one RDD. That RDD will have n partitions, where n = batch interval / block interval. Let's say you have the standard 200ms block interval and a batch interval of 2 seconds, then you will have 10 partitions. Blocks are created by a receiver, and each receiver is allocated in a host. So, those 10 partitions are in a single node and are replicated to a second node.
When the RDD is submitted for processing, the hosts running the task will read the data from that host. Tasks executing on the same node will have "NODE_LOCAL" locality, while tasks executing on other nodes will have "ANY" locality and will take longer.
Therefore, to improve parallel processing, it's recommended to allocate several receivers and use union to create a single DStream for further processing. That way data will be consumed and processed by several nodes in parallel.
Upvotes: 20