Reputation: 27
I have Rdd[(Int, Double)]
and an array[Int]
and i want to get a new Rdd[(Int, Double)]
with only those Int
that exist in the array too.
E.g if my array
is [0, 1, 2]
and my rdd
is (1, 4.2), (5, 4.3)
, i want to get as output rdd
only the (1, 4.2)
I am thinking about using filter
with a function that iterates the array, do the comparison and returns true/false
but i am not sure if it is the logic of spark
.
Something like:
val newrdd = rdd.filter(x => f(x._1, array))
where
f(x:Int, y:Array[In]): Boolean ={
val z = false
for (a<-0 to y.length-1){
if (x == y(a)){
z = true
z}
z
}
Upvotes: 0
Views: 1003
Reputation: 42443
val acceptableValues = array.toSet
rdd.filter { case (x, _) => acceptableValues(x) }
Upvotes: 0
Reputation: 1528
Try this:
rdd.filter(x => Array(0,1,2).contains(x._1)).collect.foreach(println)
Output:
(1,4.2)
Upvotes: 0
Reputation: 2091
//Input rdd
val rdd = sc.parallelize(Seq((1,4.2),(5,4.3)))
//array, convert to rdd
val arrRdd = sc.parallelize(Array(0,1,2))
//convert rdd and arrRdd to dataframe
val arrDF = arrRdd.toDF()
val df = rdd.toDF()
//do join and again convert it to rdd
df.join(arrDF,df.col("_1") === arrDF.col("value"),"leftsemi").rdd.collect
//output Array([1,4.2])
Upvotes: 1