Reputation: 3775
Is there any way to collect all RDD[(String, String)]
into one RDD[Map[String, String]]
?
E.g., for file input.csv
:
1,one
2,two
3,three
Code:
val file = sc.textFile("input.csv")
val pairs = file.map(line => { val a = line.split(","); (a(0), a(1)) })
val rddMap = ???
Output (approximate):
val map = rddMap.collect
map: Array[scala.collection.immutable.Map[String,String]] = Array(Map(1 -> one, 2 -> two, 3 -> three))
Tried pairs.collectAsMap
but it returns Map
not inside RDD
.
Upvotes: 0
Views: 3574
Reputation: 11985
How many maps would you get in a RDD[Map[String, String]]
? Just one, right? The RDD
distributes its content, because it's a distributed collection, but if it contains only one element, it becomes quite harder to distribute that collection, doesn't it ?
I would suggest you need hash-based lookup in your PairRDD
of String
s. Thankfully, you already have this in the API, with the lookup
function.
Look at the code for lookup
, it does use hashing to get to your key, in a similar way a Map
would. Building the keys and values correctly in your PairRDD
are enough for your purposes, even if building them is complex.
Upvotes: 0
Reputation: 40380
I don't actually agree with what you are trying to do. Because if you do so, you're map will be distributed on the cluster but it won't be one map!
You can use a key-value pair RDD and use lookup
method to find your value upon a given key !
def lookup(key: K): Seq[V] // Return the list of values in the RDD for key key.
And here is an example about it's usage:
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.lookup(5)
res0: Seq[String] = WrappedArray(tiger, eagle)
For more information about pair RDDs
, I suggest that you read the Chapter 4. Working with Key/Value Pairs - Learning Spark.
Upvotes: 1
Reputation: 1819
If you want preserve your map only for the time of execution of your diriver program you can collect it to the local map (in the driver) for the next task it will be available in the closer (you can just use it in function passed to the next task). If you dont want trasport it many times you can broadcast it.
On the other hand if you want to use it in different driver programs you can just serialize it and save on hdfs (or any other storage system you use). In this case even if you would have RDD you could not preserve it between drivers without saving it to the file system.
Upvotes: 0