gersh
gersh

Reputation: 2437

How do you get Apache Spark to reduce before finishing map to reduce memory usage?

I'm doing a map-reduce job with Apache Spark, but the mapping step produces a structure, which uses up a lot of memory. How can I get it to reduce and delete from memory the map before adding additional mapped objects to memory?

I'm basically doing myrdd.map(f).reduce(r). However, f returns a very big object, so I need the reducer to run and then delete the mapped objects from memory before too many pile up. Can I do this somehow?

Upvotes: 0

Views: 68

Answers (2)

simpadjo
simpadjo

Reputation: 4017

trait SmallThing

trait BigThing

val mapFunction: SmallThing => BigThing = ???
val reduceFunction: (BigThing, BigThing) => BigThing = ???

val rdd: RDD[SmallThing] = ???

//initial implementation:
val result1: BigThing = rdd.map(mapFunction).reduce(reduceFunction)

//equivalent implementation:
val emptyBigThing: BigThing = ???
val result2: BigThing = rdd.aggregate(emptyBigThing)(seqOp = (agg, small) => reduceFunction(agg, mapFunction(small)), combOp = reduceFunction)

Upvotes: 0

wandermonk
wandermonk

Reputation: 7356

Similar to combiner in MapReduce, when working with key/value pairs, combineByKey() interface can be used to customize the combiner functionality. Methods like reduceByKey() by default use their own combiner to combine the data locally in each Partition, for a given key

Similar to aggregate()(which is used with single element RDD), combineByKey() allows user to return different RDD element type compared to the element type of Input RDD.

Upvotes: 0

Related Questions