Reputation: 199
I try to match two RDD's: RDD1 contains a huge amount of words [String] and RDD2 contains city names [String].
I want to return a RDD with the elements from RDD1 that are in RDD2.
Something like the opposite to subtract
.
Afterwards I want to count the occurrence of each remaining word, but that won't be a problem.
thank you
Upvotes: 0
Views: 978
Reputation: 2967
I want to return an RDD with the elements from RDD1 that are in RDD2
If I got you right:
rdd1.subtract(rdd2.subtract(rdd1))
Note the difference between this code and intersection
:
val rdd1 = sc.parallelize(Seq("a", "a", "b", "c"))
val rdd2 = sc.parallelize(Seq("a", "c", "d"))
val diff = rdd1.subtract(rdd2)
rdd1.subtract(diff).collect()
res0: Array[String] = Array(a, a, c)
rdd1.intersection(rdd2).collect()
res1: Array[String] = Array(a, c)
So, if your first RDD contains duplicates, and your goal is to take into account that duplicates, your may prefer double subtract
solution. Otherwise, intersection
fits well.
Upvotes: 2