Tmpoul
Tmpoul

Reputation: 27

How to get Rdd values that exists in array?

I have Rdd[(Int, Double)] and an array[Int] and i want to get a new Rdd[(Int, Double)] with only those Int that exist in the array too.

E.g if my array is [0, 1, 2] and my rdd is (1, 4.2), (5, 4.3), i want to get as output rdd only the (1, 4.2)

I am thinking about using filter with a function that iterates the array, do the comparison and returns true/false but i am not sure if it is the logic of spark.

Something like:

val newrdd = rdd.filter(x => f(x._1, array)) 

where

f(x:Int, y:Array[In]): Boolean ={
   val z = false 
   for (a<-0 to y.length-1){
         if (x == y(a)){
            z = true
            z}
       z

}

Upvotes: 0

Views: 1003

Answers (3)

Hosam Aly
Hosam Aly

Reputation: 42443

val acceptableValues = array.toSet
rdd.filter { case (x, _) => acceptableValues(x) }

Upvotes: 0

1pluszara
1pluszara

Reputation: 1528

Try this:

rdd.filter(x => Array(0,1,2).contains(x._1)).collect.foreach(println)

Output:

(1,4.2) 

Upvotes: 0

Chandan Ray
Chandan Ray

Reputation: 2091

//Input rdd

val rdd = sc.parallelize(Seq((1,4.2),(5,4.3)))

//array, convert to rdd

val arrRdd = sc.parallelize(Array(0,1,2))

//convert rdd and arrRdd to dataframe

val arrDF = arrRdd.toDF()
val df = rdd.toDF()

//do join and again convert it to rdd

df.join(arrDF,df.col("_1") === arrDF.col("value"),"leftsemi").rdd.collect

//output Array([1,4.2])

Upvotes: 1

Related Questions