Gridou
Gridou

Reputation: 109

spark reduceByKey performance/complexity when reducing lists with scala

I need to perform a reduceByKey on lists. What would be the fastest solution ? I'm using the ::: operator to merge 2 list in the reduce operation, but ::: is O(n) so I am afraid the reduce operation will end up being O(n2).

Code example :

val rdd: RDD[int, List[int]] = getMyRDD()
rdd.reduceByKey(_ ::: _)

What would be the best/most efficient solution ?

Upvotes: 1

Views: 561

Answers (1)

zero323
zero323

Reputation: 330073

The best you can do is:

rdd.groupByKey.mapValues(_.flatten.toList)

This will:

  • Skip obsolete map-side reduce. It requires marginally larger shuffle but significantly reduces GC time.
  • Use mutable buffer with amortized constant append time for intermediate aggregations.
  • Flatten intermediate aggregate in O(N) time.

If you want map-side reduction you can use aggregateByKey:

import scala.collection.mutable.ArrayBuffer

rdd.aggregateByKey(ArrayBuffer[Int]())(_ ++= _, _ ++= _).mapValues(_.toList)

but usually it will be significantly more expensive compared to the first solution.

Upvotes: 2

Related Questions