Apache Spark: Reduce number of cores during an execution

Question

Is there a way to reduce the number of cores / executors during a certain part of the run. We don't want to overrun the end datastore, but need more cores to do computational work effectively.

Basically

// want n cores here
val eventJsonRdd: RDD[(String,(Event, Option[Article]))] = eventGeoRdd.leftOuterJoin(articlesRdd)

val toSave =  eventJsonRdd.map(processEventsAndArticlesJson)

// want two cores here
toSave.saveToEs("apollobit/events")

Sean Owen · Accepted Answer

You can try:

toSave.repartition(2).saveTo...

Although this will entail a potentially expensive shuffle.

If your store supports bulk updates, you will get way better performance by calling foreachPartition and doing something with a chunk of data rather than one at a time.

Apache Spark: Reduce number of cores during an execution

Answers (1)

Related Questions