red1ynx
red1ynx

Reputation: 3775

Spark RDD[(String, String)] into RDD[Map[String, String]]

Is there any way to collect all RDD[(String, String)] into one RDD[Map[String, String]]?

E.g., for file input.csv:

1,one
2,two
3,three

Code:

val file = sc.textFile("input.csv")
val pairs = file.map(line => { val a = line.split(","); (a(0), a(1)) })
val rddMap = ???

Output (approximate):

val map = rddMap.collect
map: Array[scala.collection.immutable.Map[String,String]] = Array(Map(1 -> one, 2 -> two, 3 -> three))

Tried pairs.collectAsMap but it returns Map not inside RDD.

Upvotes: 0

Views: 3574

Answers (3)

Francois G
Francois G

Reputation: 11985

How many maps would you get in a RDD[Map[String, String]] ? Just one, right? The RDD distributes its content, because it's a distributed collection, but if it contains only one element, it becomes quite harder to distribute that collection, doesn't it ?

I would suggest you need hash-based lookup in your PairRDD of Strings. Thankfully, you already have this in the API, with the lookup function.

Look at the code for lookup, it does use hashing to get to your key, in a similar way a Map would. Building the keys and values correctly in your PairRDD are enough for your purposes, even if building them is complex.

Upvotes: 0

eliasah
eliasah

Reputation: 40380

I don't actually agree with what you are trying to do. Because if you do so, you're map will be distributed on the cluster but it won't be one map!

You can use a key-value pair RDD and use lookup method to find your value upon a given key !

def lookup(key: K): Seq[V]  // Return the list of values in the RDD for key key.

And here is an example about it's usage:

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))  
b.lookup(5) 
res0: Seq[String] = WrappedArray(tiger, eagle)

For more information about pair RDDs, I suggest that you read the Chapter 4. Working with Key/Value Pairs - Learning Spark.

Upvotes: 1

abalcerek
abalcerek

Reputation: 1819

If you want preserve your map only for the time of execution of your diriver program you can collect it to the local map (in the driver) for the next task it will be available in the closer (you can just use it in function passed to the next task). If you dont want trasport it many times you can broadcast it.

On the other hand if you want to use it in different driver programs you can just serialize it and save on hdfs (or any other storage system you use). In this case even if you would have RDD you could not preserve it between drivers without saving it to the file system.

Upvotes: 0

Related Questions