preitam ojha
preitam ojha

Reputation: 239

Is it possible to run multiple aggregation jobs on a single dataframe in parallel in spark?

Is there any way to run multiple independent aggregation jobs on a single RDD in parallel? First preference is Python then Scala and Java.

The course of actions in order of preference are -

  1. Using Threadpool - run different functions doing different aggregations on different threads. I did not see an example which does this.

  2. Using cluster mode on yarn , submitting different jars. Is this possible , if yes then is it possible in pyspark?

  3. Using Kafka - run different spark-submits on the dataframe streaming through kafka.

I am quite new to Spark , and my experience ranges on running Spark on Yarn for ETL doing multiple aggregations serially. I was thinking if it was possible to run these aggregations in parallel as they are mostly independent.

Upvotes: 0

Views: 690

Answers (1)

eliasah
eliasah

Reputation: 40360

Consider your broad question, here is a broad answer :

Yes, it is possible to run multiple aggregation jobs on a single DataFrame in parallel.

For the rest, it doesn't seem to be clear what you are asking.

Upvotes: 0

Related Questions