Spark rdd unique values across a paired rdd

Question

I have a Spark RDD of this datatype: RDD[(Int, Array[Int])])

Sample values of that RDD are:

100, Array(1,2,3,4,5)

200,Array(1,2,50,20)

300, Array(30,2,400,1)

I would like to get all the unique values among all the Array elements of this RDD I don't care about the key, just want to get all the unique values. So the result from the above sample is (1,2,3,4,5,20,30,50,400).

What will be an efficient way to do that.

Jason Scott Lenderman · Accepted Answer

I think this should probably work:

val result = rdd.flatMap(_._2).distinct

if you want the result in an RDD, or

val result = rdd.flatMap(_._2).distinct.collect

if you want the result in a local collection.

Spark rdd unique values across a paired rdd

Answers (1)

Related Questions