Soumitra
Soumitra

Reputation: 612

How to convert Scala RDD to Map

I have a RDD (array of String) org.apache.spark.rdd.RDD[String] = MappedRDD[18] and to convert it to a map with unique Ids. I did 'val vertexMAp = vertices.zipWithUniqueId' but this gave me another RDD of type 'org.apache.spark.rdd.RDD[(String, Long)]' but I want a 'Map[String, Long]' . How can I convert my 'org.apache.spark.rdd.RDD[(String, Long)] to Map[String, Long]' ?

Thanks

Upvotes: 7

Views: 31420

Answers (3)

WestCoastProjects
WestCoastProjects

Reputation: 63221

You do not need to convert. The implicits for PairRDDFunctions detects a Two-Tuple based RDD and applies the PairRDDFunctions methods automatically.

Upvotes: 3

maasg
maasg

Reputation: 37435

There's a built-in collectAsMap function in PairRDDFunctions that would deliver you a map of the pair values in the RDD.

val vertexMAp = vertices.zipWithUniqueId.collectAsMap

It's important to remember that an RDD is a distributed data structure. You can visualize it a 'pieces' of your data spread over the cluster. When you collect, you force all those pieces to go to the driver and to be able to do that, they need to fit in the memory of the driver.

From the comments, it looks like in your case, you need to deal with a large dataset. Making a Map out of it is not going to work as it won't fit on the driver's memory; causing OOM exceptions if you try.

You probably need to keep the dataset as an RDD. If you are creating a Map in order to lookup elements, you could use lookup on a PairRDD instead, like this:

import org.apache.spark.SparkContext._  // import implicits conversions to support PairRDDFunctions

val vertexMap = vertices.zipWithUniqueId
val vertixYId = vertexMap.lookup("vertexY")

Upvotes: 26

Eugene Zhulenev
Eugene Zhulenev

Reputation: 9734

Collect to "local" machine and then convert Array[(String, Long)] to Map

val rdd: RDD[String] = ???

val map: Map[String, Long] = rdd.zipWithUniqueId().collect().toMap

Upvotes: 8

Related Questions