Kaushal
Kaushal

Reputation: 3367

How to get new RDD from PairRDD based on Key

In my Spark Application, I am using one JavaPairRDD<Integer, List<Tuple3<String, String, String>>> which has large amount of data.

And my requirement is that i need some other RDDs JavaRDD<Tuple3<String, String, String>> from that Large PairRDD based on keys.

Upvotes: 2

Views: 1160

Answers (1)

Daniel Darabos
Daniel Darabos

Reputation: 27470

I don't know the Java API, but here's how you would do it in Scala (in spark-shell):

def rddByKey[K: ClassTag, V: ClassTag](rdd: RDD[(K, Seq[V])]) = {
  rdd.keys.distinct.collect.map {
    key => key -> rdd.filter(_._1 == key).values.flatMap(identity)
  }
}

You have to filter for each key and flatten the Lists with flatMap.

I have to mention that this is not a useful operation. If you were able to build the original RDD, that means each List is small enough to fit into memory. So I don't see why you would want to make them into RDDs.

Upvotes: 3

Related Questions