user2200660
user2200660

Reputation: 1281

Filter one RDD based on keys in another

I have two RDDs:

**rdd1**
id1 val1
id2 val2

**rdd2**
id1 v1
id2 v2
id1 v3
id8 v7
id1 v4
id3 v5
id6 v6

I want to filter RDD2 such that it contains keys of rdd1 only. So the output will be

**output**
id1 v1
id2 v2
id1 v3
id1 v4

This has been asked in stackoverflow before but for a smaller dataset where people broadcasted set and then used to filter, but my problem is rdd1 size is > 500 million, and rdd2 is more than 10 billion

Pls help

Upvotes: 4

Views: 2178

Answers (1)

Jean Logeart
Jean Logeart

Reputation: 53859

Use join:

val res: RDD[(Long, V)] = rdd1.join(rdd2)
                              .map { case(k, (_, v2)) => (k, v2) }

Upvotes: 6

Related Questions