Michal
Michal

Reputation: 1905

Convert Spark DataFrame to HashMap

I have a dataframe that looks like this:

product1 product2 difference
123      456      0.5
123      789      1
456      789      0
456      123      0.5
789      123      1
789      456      0

I would like an output that looks like this:

{'123': {'456': 0.5, '789': 1}, 456: {'123': 0.5, '789': 1}, '789': {'123': 1, '456': 0}}

so far I have tried zipWithIndex and collectAsMap with no luck.

The code I have tried so far is:

val tples: RDD[(Int, (Int, Double))] = (products.rdd
  .map(r => (r(0).toString.toDouble.toInt, (r(1).toString.toDouble.toInt, r(2).toString.toDouble))))
val lst: = tpls.groupByKey().map(r => (r._1, r._2.toSeq))

This gives me a list of products and differences instead of a hash map

Upvotes: 1

Views: 8620

Answers (2)

Leo C
Leo C

Reputation: 22449

You can first convert the dataframe to a RDD, transform it to key-value type, and perform a groupByKey. To obtain the result in the wanted Map form, you'll need to collect the grouped RDD (thus may not be doable for large dataset):

val df = Seq(
  (123, 456, 0.5),
  (123, 789, 1.0),
  (456, 789, 0.0),
  (456, 123, 0.5),
  (789, 123, 1.0),
  (789, 456, 0.0)
).toDF("product1", "product2", "difference")

import org.apache.spark.sql.Row

val groupedRDD = df.rdd.map{
    case Row(p1: Int, p2: Int, diff: Double) => (p1, (p2, diff))
  }.
  groupByKey.mapValues(_.toMap)

groupedRDD.collectAsMap
// res1: scala.collection.immutable.Map[Any,scala.collection.immutable.Map[Int,Double]] = Map(
//   456 -> Map(789 -> 0.0, 123 -> 0.5), 789 -> Map(123 -> 1.0, 456 -> 0.0), 123 -> Map(456 -> 0.5, 789 -> 1.0)
// )

Upvotes: 2

Andrey Tyukin
Andrey Tyukin

Reputation: 44992

If I understand your question correctly, you want something like this:

val myRdd = sc.makeRDD(List(
  (123, (456, 0.5)), 
  (123, (789, 1.0)), 
  (456, (789, 0.0)), 
  (456, (123, 0.5)), 
  (789, (123, 1.0)), 
  (789, (456, 0.0))
))


val myHashMap = myRdd.groupByKey.mapValues(_.toMap).collect.toMap

// gives:
// scala.collection.immutable.Map[Int,scala.collection.immutable.Map[Int,Double]] = 
//   Map(
//     456 -> Map(789 -> 0.0, 123 -> 0.5), 
//     789 -> Map(123 -> 1.0, 456 -> 0.0), 
//     123 -> Map(456 -> 0.5, 789 -> 1.0)
//   )

Brief explanation: groupByKey gives you tuples like (123, Seq((456, 0.5), (789, 1.0))). You want to convert the second component (the "values") to a map, so you call mapValues(_.toMap). Then (if you really want to load the collection to your node and convert it to local, non-distributed map), you must call collect. This gives you essentially a list of tuples of type (Int, Map[Int, Double]). Now you can call toMap on this collection to obtain a map of maps.

Upvotes: 0

Related Questions