How to flatten a list in spark rdd?

Question

I have a RDD(Long, util.List[Foo]) which I want to flatten on the list to look like RDD(Long, Foo) and then eventually call a getCode method which is a part of the Foo. Here is my approach so far

val test = source
  .filter(x => x.getFooList != null)
  .map(x => (x.getFooList, x.getId))
  .map{
    case(foo, id) => foo.toArray().map(foo => (foo, id))
  }

ideally I would like to have the id at the first position

This method works. However to toArray method converts it from Foo to AnyRef. I cannot call the getCode method on AnyRef. What is the best way to do this?

Tzach Zohar · Accepted Answer

First - if you want to flatten your RDD you'll have to use flatMap and not map. Second, if you want the "id" to come first - place it first in the tuple you're building for each item. And third - since your source RDD contains java.util.Lists, you'll have to convert them (can be done implicitly with the right import) to Scala collections:

import scala.collection.JavaConversions._ // import to get implicit conversion 

val test: RDD[(Long, Foo)] = source
 .filter(x => x.getFooList != null)
 .map(x => (x.getFooList, x.getId))
 .flatMap { // use flatMap
    case (foo, id) => foo.map(f => (id, f)) // switch the order
  }

How to flatten a list in spark rdd?

Answers (2)

Related Questions