Reputation: 2437
I'm doing a map-reduce job with Apache Spark, but the mapping step produces a structure, which uses up a lot of memory. How can I get it to reduce and delete from memory the map before adding additional mapped objects to memory?
I'm basically doing myrdd.map(f).reduce(r). However, f returns a very big object, so I need the reducer to run and then delete the mapped objects from memory before too many pile up. Can I do this somehow?
Upvotes: 0
Views: 68
Reputation: 4017
trait SmallThing
trait BigThing
val mapFunction: SmallThing => BigThing = ???
val reduceFunction: (BigThing, BigThing) => BigThing = ???
val rdd: RDD[SmallThing] = ???
//initial implementation:
val result1: BigThing = rdd.map(mapFunction).reduce(reduceFunction)
//equivalent implementation:
val emptyBigThing: BigThing = ???
val result2: BigThing = rdd.aggregate(emptyBigThing)(seqOp = (agg, small) => reduceFunction(agg, mapFunction(small)), combOp = reduceFunction)
Upvotes: 0
Reputation: 7356
Similar to combiner in MapReduce, when working with key/value pairs, combineByKey()
interface can be used to customize the combiner functionality. Methods like reduceByKey()
by default use their own combiner to combine the data locally in each Partition, for a given key
Similar to aggregate()
(which is used with single element RDD), combineByKey()
allows user to return different RDD element type compared to the element type of Input RDD.
Upvotes: 0