Reputation: 3367
In my Spark Application, I am using one JavaPairRDD<Integer, List<Tuple3<String, String, String>>>
which has large amount of data.
And my requirement is that i need some other RDDs JavaRDD<Tuple3<String, String, String>>
from that Large PairRDD based on keys.
Upvotes: 2
Views: 1160
Reputation: 27470
I don't know the Java API, but here's how you would do it in Scala (in spark-shell
):
def rddByKey[K: ClassTag, V: ClassTag](rdd: RDD[(K, Seq[V])]) = {
rdd.keys.distinct.collect.map {
key => key -> rdd.filter(_._1 == key).values.flatMap(identity)
}
}
You have to filter
for each key and flatten the List
s with flatMap
.
I have to mention that this is not a useful operation. If you were able to build the original RDD, that means each List
is small enough to fit into memory. So I don't see why you would want to make them into RDDs.
Upvotes: 3