getaway22
getaway22

Reputation: 199

Match two RDDs [String]

I try to match two RDD's: RDD1 contains a huge amount of words [String] and RDD2 contains city names [String].

I want to return a RDD with the elements from RDD1 that are in RDD2. Something like the opposite to subtract.

Afterwards I want to count the occurrence of each remaining word, but that won't be a problem.

thank you

Upvotes: 0

Views: 978

Answers (1)

Vitalii Kotliarenko
Vitalii Kotliarenko

Reputation: 2967

I want to return an RDD with the elements from RDD1 that are in RDD2

If I got you right:

rdd1.subtract(rdd2.subtract(rdd1))

Note the difference between this code and intersection:

val rdd1 = sc.parallelize(Seq("a", "a", "b", "c"))
val rdd2 = sc.parallelize(Seq("a", "c", "d"))
val diff = rdd1.subtract(rdd2)
rdd1.subtract(diff).collect()
res0: Array[String] = Array(a, a, c)
rdd1.intersection(rdd2).collect()
res1: Array[String] = Array(a, c)

So, if your first RDD contains duplicates, and your goal is to take into account that duplicates, your may prefer double subtract solution. Otherwise, intersection fits well.

Upvotes: 2

Related Questions