Reputation: 239
Is there any way to run multiple independent aggregation jobs on a single RDD in parallel? First preference is Python then Scala and Java.
The course of actions in order of preference are -
Using Threadpool - run different functions doing different aggregations on different threads. I did not see an example which does this.
Using cluster mode on yarn , submitting different jars. Is this possible , if yes then is it possible in pyspark?
Using Kafka - run different spark-submits on the dataframe streaming through kafka.
I am quite new to Spark , and my experience ranges on running Spark on Yarn for ETL doing multiple aggregations serially. I was thinking if it was possible to run these aggregations in parallel as they are mostly independent.
Upvotes: 0
Views: 690
Reputation: 40360
Consider your broad question, here is a broad answer :
Yes, it is possible to run multiple aggregation jobs on a single DataFrame in parallel.
For the rest, it doesn't seem to be clear what you are asking.
Upvotes: 0