flatten Options inside Spark RDD

Question

I'm trying to reproduce the behaviour of scala vanilla collections' flatten with Option on a Spark RDD. For example:

Seq(Some(1), None, Some(2), None, None).flatten
> Seq[Int] = List(1, 2)
// None are removed, and Some are unwrapped

sc.parallelize(Seq(Some(1), None, Some(2))).flatten.collect()
> error: value flatten is not a member of org.apache.spark.rdd.RDD[Option[Int]]
>      sc.parallelize(Seq(Some(1), None, Some(2), None, None)).flatten.collect()
// The function does not exist for RDDs

Of course the following works but this means that you collect before filtering, e.g. collect a larger collection on a single machine.

sc.parallelize(Seq(Some(1), None, Some(2))).collect().flatten
> Array[Int] = Array(1, 2)

The solution I found is

sc.parallelize(Seq(Some(1), None, Some(2))).filter(_.isDefined).map(_.get).collect()

but that's not very clear. Is there a cleaner way?

Leo C · Accepted Answer

You can simply perform flatMap with an identity function:

sc.parallelize(Seq(Some(1), None, Some(2))).flatMap(x => x).collect
// res1: Array[Int] = Array(1, 2)

flatten Options inside Spark RDD

Answers (2)

Related Questions