Reputation: 391
I have a RDD in spark which looks like this -
[Foo1, Bar[bar1,bar2]]
The Bar object has a getList method which may return the lists [bar11,bar12,bar13] and [bar21, bar22] respectively. I want the output to look like this -
[Foo1, [bar11, bar12, bar13, bar21, bar22]]
The approach that I am able to think of is something like this -
my_rdd.map(x => (x._1,x._2.getList))
.flatmap{
case(x,y) => y.map(x, _)
}
The first map operation is returning me Foo1 and all the lists. However I am not able to flatten them beyond that.
Upvotes: 1
Views: 1460
Reputation: 3725
You can do this with one line:
my_rdd.mapValues(_.flatMap(_.getList))
There is another answer which uses map
instead of mapValues
. While this produces the same RDD elements, I think it's important to get in the practice of using the "minimal" function necessary with Spark RDDs, because you can actually pay a pretty huge performance cost for using map
instead of mapValues
without realizing it -- The map
function on RDD strips the partitioner, if it exists, and mapValues
does not.
If you have an RDD[(K, V)]
and call rdd.groupByKey()
, you'll end up with an RDD[(K, Array[V])]
that is partitioned by K
. If you want to join
with another RDD by K
, you've already done most of the work.
If you add a map
in between the groupByKey()
and join
, Spark will re-shuffle that RDD. This is very painful! mapValues
is safe.
Upvotes: 1
Reputation: 19
In your code the x._2.getList returns a list of lists. Use flatten method as follows to have the expected result :
my_rdd.map(x => (x._1,x._2.getList.flatten))
Upvotes: 1