Reputation: 475

Flatten value in paired RDD in spark

I have a paired RDD that looks like

(a1, (a2, a3))
(b1, (b2, b3))
...

I want to flatten the values to obtain

(a1, a2, a3)
(b1, b2, b3)
...

Currently I'm doing

rddData.map(x => (x._1, x._2._1, x._2._2))

Is there a better way of performing the conversion? The above solution gets ugly if value contains many elements instead of just 2.

Upvotes: 1

Answers (2)

Jeremy

Reputation: 1924

When I'm trying to avoid all the ugly underscore number stuff that comes with tuple manipulation I like to use case notation:

rddData.map { case (a, (b, c)) => (a, b, c) }

You can also give your variables meaningful names to make your code self documenting and the use of curly braces means you have fewer nested parentheses.

EDIT: The map { case ... } pattern is pretty compact and can be used for surprisingly deep nested tuples as long as the structure is known at compile time. If you absolutely, positively cannot know the structure of the tuple at compile time, then here is some hacky, slow code that, probably, can flatten any arbitrarily nested tuple... as long as there are no more than 23 elements in total. It works by recursivly converting each element of the tuple to a list, flatmap-ing it to a single list, then using scary reflection to convert the list back into a tuple as seen here.

def flatten(b:Product): List[Any] = { 
  b.productIterator.toList.flatMap {
    case x: Product => flatten(x)
    case y: Any => List(y)
  }
}

def toTuple[Any](as:List[Any]):Product = {
  val tupleClass = Class.forName("scala.Tuple" + as.size)
  tupleClass.getConstructors.apply(0).newInstance(as.map(_.asInstanceOf[AnyRef]):_*).asInstanceOf[Product]
}

rddData.map(t => toTuple(flatten(t)))

Upvotes: 3

Ged

Reputation: 18098

There is no better way. The 1st answer is equivalent to:

val abc2 = xyz.map{ case (k, v) => (k, v._1, v._2) }

which is equivalent to your own example.

Upvotes: 1

Flatten value in paired RDD in spark

Answers (2)

Related Questions