RDD partitioning in spark Streaming

Question

The batches in spark streaming are the batches of RDD .Suppose batch of 3 RDDs.

Also spark documentation says that a block is created every 200ms by reciever , and partition is allotted to the block.

Say in 1 second I have batch of 3 RDDs , with 5 blocks if 200ms is considered.

So how will a RDD get partitioned across worker nodes , is the single RDD that will be partitioned or a complete batch.

I may have taken it in a wrong way . Please guide me

maasg · Accepted Answer

One streaming batch corresponds to one RDD. That RDD will have n partitions, where n = batch interval / block interval. Let's say you have the standard 200ms block interval and a batch interval of 2 seconds, then you will have 10 partitions. Blocks are created by a receiver, and each receiver is allocated in a host. So, those 10 partitions are in a single node and are replicated to a second node.

When the RDD is submitted for processing, the hosts running the task will read the data from that host. Tasks executing on the same node will have "NODE_LOCAL" locality, while tasks executing on other nodes will have "ANY" locality and will take longer.

Therefore, to improve parallel processing, it's recommended to allocate several receivers and use union to create a single DStream for further processing. That way data will be consumed and processed by several nodes in parallel.

RDD partitioning in spark Streaming

Answers (2)

Related Questions