Reputation: 767
I have a RDD that is containing an array of key,value pairs. I want to get an element with key (say 4).
scala> val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at keyBy at <console>:29
I have tried to apply filter on it but getting error.
scala> val c = b.filter(p => p(0) = 4);
<console>:31: error: value update is not a member of (Int, String)
val c = b.filter(p => p(0) = 4);
I want to print the key,value pair with specific key (say 4) as Array((4,lion))
The data is always coming in the form of an array of key,value pair
Upvotes: 1
Views: 7008
Reputation: 37435
There's a lookup
method applicable to RDDs of Key-Value pairs (RDDs of type RDD[(K,V)]
) that directly offers this functionality.
b.lookup(4)
// res4: Seq[String] = WrappedArray(lion)
b.lookup(5)
// res6: Seq[String] = WrappedArray(tiger, eagle)
Upvotes: 1
Reputation: 35404
use p._1
instead of p(0)
.
val rdd = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 1)
val kvRdd: RDD[(Int, String)] = rdd.keyBy(_.length)
val filterRdd: RDD[(Int, String)] = kvRdd.filter(p => p._1 == 4)
//display rdd
println(filterRdd.collect().toList)
List((4,lion))
Upvotes: 1