Reputation: 1905
I have a dataframe that looks like this:
product1 product2 difference
123 456 0.5
123 789 1
456 789 0
456 123 0.5
789 123 1
789 456 0
I would like an output that looks like this:
{'123': {'456': 0.5, '789': 1}, 456: {'123': 0.5, '789': 1}, '789': {'123': 1, '456': 0}}
so far I have tried zipWithIndex
and collectAsMap
with no luck.
The code I have tried so far is:
val tples: RDD[(Int, (Int, Double))] = (products.rdd
.map(r => (r(0).toString.toDouble.toInt, (r(1).toString.toDouble.toInt, r(2).toString.toDouble))))
val lst: = tpls.groupByKey().map(r => (r._1, r._2.toSeq))
This gives me a list of products and differences instead of a hash map
Upvotes: 1
Views: 8620
Reputation: 22449
You can first convert the dataframe to a RDD, transform it to key-value type, and perform a groupByKey
. To obtain the result in the wanted Map
form, you'll need to collect
the grouped RDD (thus may not be doable for large dataset):
val df = Seq(
(123, 456, 0.5),
(123, 789, 1.0),
(456, 789, 0.0),
(456, 123, 0.5),
(789, 123, 1.0),
(789, 456, 0.0)
).toDF("product1", "product2", "difference")
import org.apache.spark.sql.Row
val groupedRDD = df.rdd.map{
case Row(p1: Int, p2: Int, diff: Double) => (p1, (p2, diff))
}.
groupByKey.mapValues(_.toMap)
groupedRDD.collectAsMap
// res1: scala.collection.immutable.Map[Any,scala.collection.immutable.Map[Int,Double]] = Map(
// 456 -> Map(789 -> 0.0, 123 -> 0.5), 789 -> Map(123 -> 1.0, 456 -> 0.0), 123 -> Map(456 -> 0.5, 789 -> 1.0)
// )
Upvotes: 2
Reputation: 44992
If I understand your question correctly, you want something like this:
val myRdd = sc.makeRDD(List(
(123, (456, 0.5)),
(123, (789, 1.0)),
(456, (789, 0.0)),
(456, (123, 0.5)),
(789, (123, 1.0)),
(789, (456, 0.0))
))
val myHashMap = myRdd.groupByKey.mapValues(_.toMap).collect.toMap
// gives:
// scala.collection.immutable.Map[Int,scala.collection.immutable.Map[Int,Double]] =
// Map(
// 456 -> Map(789 -> 0.0, 123 -> 0.5),
// 789 -> Map(123 -> 1.0, 456 -> 0.0),
// 123 -> Map(456 -> 0.5, 789 -> 1.0)
// )
Brief explanation: groupByKey
gives you tuples like (123, Seq((456, 0.5), (789, 1.0)))
. You want to convert the second component (the "values") to a map, so you call mapValues(_.toMap)
. Then (if you really want to load the collection to your node and convert it to local, non-distributed map), you must call collect
. This gives you essentially a list of tuples of type (Int, Map[Int, Double])
. Now you can call toMap
on this collection to obtain a map of maps.
Upvotes: 0