Reputation: 7583

Dealing with Iterable in apache-spark

Lets say I have a RDD[Int]. After I do a groupBy by some discriminator function I am left with an RDD[(Int, Iterable[Int])].

Since this Iterable can be large it should be distributed among nodes. But there are no means to deal with it like with an RDD.

For example if I want to further do some pairing and aggregating by key with one of the Iterables.

Or lets say I want to sort one of them and find the median

I suppose it is not legal to call one of the .toList or .toSeq methods here because the regular scala collections are not distributed.

So what is the right approach for dealing with Iterables?

Upvotes: 0

Answers (2)

Vidya

Reputation: 30310

You almost certainly don't want to do a groupBy. One of the biggest performance issues in Spark jobs is the shuffling of data around the cluster because of poor partitioning and data locality. If you're doing a groupBy, presumably you want to partition your data on that key and keep your data as close to each other as possible. So in the end, a groupBy suggests you actually don't want your data distributed away from the partition if you can avoid it.

But you want things need to be more distributed. You probably want to do something like this:

val rdd: RDD[Int] = ...
val rdd2: RDD[(Int, Int)] = rdd.map(i => (key(i), i))
val rdd3: RDD[(Int, Int)] = rdd2.reduceByKey((accumulator, i) => myFunction(accumulator, i))

Upvotes: 1

dursun

Reputation: 1846

you can use aggregateByKey or reduceByKey transformations and in order to take the result you can use actions like collect

Upvotes: 0

Dealing with Iterable in apache-spark

Answers (2)

Related Questions