sunnydev
sunnydev

Reputation: 11

Reg: Parallelizing RDD partitions in Spark executors

I am new to spark and trying out a sample Spark Kafka Integration. What I have done is posted jsons from single partitioned Kafka:

{"city":"LosAngeles","country":"US"}
{"city":"MexicoCity","country":"Mexico"}
{"city":"London","country":"UK"}
{"city":"NewYork","country":"US"}
{"city":"NewCastle","country":"UK"}

I am doing the following steps:

  1. I have a DStream receiver in the spark job.

  2. I do a repartition of 3 on that Dstream

  3. Then I am executing a flatMap on this DStream to receive key/value pairs of partitions in the transformedRDDStream.

  4. I do a groupbyKey shuffling.

  5. Then I am doing another map transformation on the RDDStream step 4 to add some more keys to each value json.

  6. Finally I do a forEachPartition and collect the RDDs and print the result.

I am running this in Spark standalone mode with 3 executors in the cluster. Since I am receiving from a single partition of Kafka, I wont't be able to have parallel receivers for parallel execution of DStreams. Correct me if I am wrong, but since I am doing a repartition of 3 after receiving, I believe I will have three partitions which will be created and the subsequent map transformation on these partitions will execute in parallel in the 3 executors. But what I am observing is that all my partitions are getting executed in only one executor in sequence and the other two executors are not being used. Can I get some suggestion on this please?

How to execute the RDD partitions in a parallel executors received from a single DStream receiver of a single partitioned Kafka topic?

Upvotes: 1

Views: 58

Answers (0)

Related Questions