Reputation: 11
I am new to spark and trying out a sample Spark Kafka Integration. What I have done is posted jsons from single partitioned Kafka:
{"city":"LosAngeles","country":"US"}
{"city":"MexicoCity","country":"Mexico"}
{"city":"London","country":"UK"}
{"city":"NewYork","country":"US"}
{"city":"NewCastle","country":"UK"}
I am doing the following steps:
I have a DStream receiver in the spark job.
I do a repartition of 3 on that Dstream
Then I am executing a flatMap on this DStream to receive key/value pairs of partitions in the transformedRDDStream.
I do a groupbyKey shuffling.
Then I am doing another map transformation on the RDDStream step 4 to add some more keys to each value json.
Finally I do a forEachPartition and collect the RDDs and print the result.
I am running this in Spark standalone mode with 3 executors in the cluster. Since I am receiving from a single partition of Kafka, I wont't be able to have parallel receivers for parallel execution of DStreams. Correct me if I am wrong, but since I am doing a repartition of 3 after receiving, I believe I will have three partitions which will be created and the subsequent map transformation on these partitions will execute in parallel in the 3 executors. But what I am observing is that all my partitions are getting executed in only one executor in sequence and the other two executors are not being used. Can I get some suggestion on this please?
How to execute the RDD partitions in a parallel executors received from a single DStream receiver of a single partitioned Kafka topic?
Upvotes: 1
Views: 58