Reputation: 766

Combine two key-value collections with Spark efficiently

I have the following key-value pairs lists (like a hashmap, but not exactly inside the spark context):

val m1 = sc.parallelize(List(1 -> "a", 2 -> "b", 3 -> "c", 4 -> "d"))
val m2 = sc.parallelize(List(1 -> "A", 2 -> "B", 3 -> "C", 5 -> "E"))

I want to get something like this and do if efficiently in parallel (don't even know if it's possible

List(1 -> (Some("a"), Some("A")), 2 -> (Some("b"), Some("B")), 3 -> (Some("c"), Some("C")), 4 -> (Some("d"), None), 5 -> (None, Some("E")))

Or at least

List(1 -> ("a","A"), 2 -> ("b","B"), 3 -> ("c","C"))

How to achieve this? As I understand - I don't have efficient way to get values from the "maps" by key - these are not hashmaps really.

Upvotes: 2

Answers (4)

ayan guha

Reputation: 1257

Or, use union...

rdd1.union(rdd2).groupByKey()

Upvotes: 0

Pär Eriksson

Reputation: 367

The approved answer is correct, if you want to read about it check this Key/Value pairs article out Key/Value pairs

Upvotes: 1

Jean Logeart

Reputation: 53829

You can use the fullOuterJoin function:

val m1: RDD[(Int, String)] = //...
val m2: RDD[(Int, String)] = //...
val j: RDD[(Int, (Option[String], Option[String]))] = m1.fullOuterJoin(m2)

Depending on your use, you can use any variation of joins:

val full:  RDD[(Int, (Option[String], Option[String]))] = m1.fullOuterJoin(m2)
val left:  RDD[(Int, (String, Option[String]))]         = m1.leftOuterJoin(m2)       
val right: RDD[(Int, (Option[String], String))]         = m1.rightOuterJoin(m2)
val join:  RDD[(Int, (String, String))]                 = m1.join(m2)

Upvotes: 2

Justin Pihony

Reputation: 67075

A simple join should work:

rdd1.join(rdd2) //RDD[K, (V1,V2)]

Upvotes: 1

Combine two key-value collections with Spark efficiently

Answers (4)

Related Questions