newbie
newbie

Reputation: 391

How to flatmap nested lists in spark

I have a RDD in spark which looks like this -

[Foo1, Bar[bar1,bar2]]

The Bar object has a getList method which may return the lists [bar11,bar12,bar13] and [bar21, bar22] respectively. I want the output to look like this -

[Foo1, [bar11, bar12, bar13, bar21, bar22]]

The approach that I am able to think of is something like this -

my_rdd.map(x => (x._1,x._2.getList))
    .flatmap{
        case(x,y) => y.map(x, _)
    }

The first map operation is returning me Foo1 and all the lists. However I am not able to flatten them beyond that.

Upvotes: 1

Views: 1460

Answers (2)

Tim
Tim

Reputation: 3725

You can do this with one line:

my_rdd.mapValues(_.flatMap(_.getList))

There is another answer which uses map instead of mapValues. While this produces the same RDD elements, I think it's important to get in the practice of using the "minimal" function necessary with Spark RDDs, because you can actually pay a pretty huge performance cost for using map instead of mapValues without realizing it -- The map function on RDD strips the partitioner, if it exists, and mapValues does not.

If you have an RDD[(K, V)] and call rdd.groupByKey(), you'll end up with an RDD[(K, Array[V])] that is partitioned by K. If you want to join with another RDD by K, you've already done most of the work.

If you add a map in between the groupByKey() and join, Spark will re-shuffle that RDD. This is very painful! mapValues is safe.

Upvotes: 1

Salah LAARIDH
Salah LAARIDH

Reputation: 19

In your code the x._2.getList returns a list of lists. Use flatten method as follows to have the expected result :
my_rdd.map(x => (x._1,x._2.getList.flatten))

Upvotes: 1

Related Questions