Reputation: 7583
Lets say I have a RDD[Int]. After I do a groupBy by some discriminator function I am left with an RDD[(Int, Iterable[Int])].
Since this Iterable can be large it should be distributed among nodes. But there are no means to deal with it like with an RDD.
For example if I want to further do some pairing and aggregating by key with one of the Iterables.
Or lets say I want to sort one of them and find the median
I suppose it is not legal to call one of the .toList or .toSeq methods here because the regular scala collections are not distributed.
So what is the right approach for dealing with Iterables?
Upvotes: 0
Views: 868
Reputation: 30310
You almost certainly don't want to do a groupBy
. One of the biggest performance issues in Spark jobs is the shuffling of data around the cluster because of poor partitioning and data locality. If you're doing a groupBy
, presumably you want to partition your data on that key and keep your data as close to each other as possible. So in the end, a groupBy
suggests you actually don't want your data distributed away from the partition if you can avoid it.
But you want things need to be more distributed. You probably want to do something like this:
val rdd: RDD[Int] = ...
val rdd2: RDD[(Int, Int)] = rdd.map(i => (key(i), i))
val rdd3: RDD[(Int, Int)] = rdd2.reduceByKey((accumulator, i) => myFunction(accumulator, i))
Upvotes: 1
Reputation: 1846
you can use aggregateByKey or reduceByKey transformations and in order to take the result you can use actions like collect
Upvotes: 0