Reputation: 766
I have the following key-value pairs lists (like a hashmap, but not exactly inside the spark context):
val m1 = sc.parallelize(List(1 -> "a", 2 -> "b", 3 -> "c", 4 -> "d"))
val m2 = sc.parallelize(List(1 -> "A", 2 -> "B", 3 -> "C", 5 -> "E"))
I want to get something like this and do if efficiently in parallel (don't even know if it's possible
List(1 -> (Some("a"), Some("A")), 2 -> (Some("b"), Some("B")), 3 -> (Some("c"), Some("C")), 4 -> (Some("d"), None), 5 -> (None, Some("E")))
Or at least
List(1 -> ("a","A"), 2 -> ("b","B"), 3 -> ("c","C"))
How to achieve this? As I understand - I don't have efficient way to get values from the "maps" by key - these are not hashmaps really.
Upvotes: 2
Views: 256
Reputation: 367
The approved answer is correct, if you want to read about it check this Key/Value pairs article out Key/Value pairs
Upvotes: 1
Reputation: 53829
You can use the fullOuterJoin
function:
val m1: RDD[(Int, String)] = //...
val m2: RDD[(Int, String)] = //...
val j: RDD[(Int, (Option[String], Option[String]))] = m1.fullOuterJoin(m2)
Depending on your use, you can use any variation of joins:
val full: RDD[(Int, (Option[String], Option[String]))] = m1.fullOuterJoin(m2)
val left: RDD[(Int, (String, Option[String]))] = m1.leftOuterJoin(m2)
val right: RDD[(Int, (Option[String], String))] = m1.rightOuterJoin(m2)
val join: RDD[(Int, (String, String))] = m1.join(m2)
Upvotes: 2
Reputation: 67075
A simple join
should work:
rdd1.join(rdd2) //RDD[K, (V1,V2)]
Upvotes: 1