Reputation: 1281
I have two RDDs:
**rdd1**
id1 val1
id2 val2
**rdd2**
id1 v1
id2 v2
id1 v3
id8 v7
id1 v4
id3 v5
id6 v6
I want to filter RDD2 such that it contains keys of rdd1 only. So the output will be
**output**
id1 v1
id2 v2
id1 v3
id1 v4
This has been asked in stackoverflow before but for a smaller dataset where people broadcasted set and then used to filter, but my problem is rdd1 size is > 500 million, and rdd2 is more than 10 billion
Pls help
Upvotes: 4
Views: 2178
Reputation: 53859
Use join:
val res: RDD[(Long, V)] = rdd1.join(rdd2)
.map { case(k, (_, v2)) => (k, v2) }
Upvotes: 6