Reputation: 79
I've tried working this out and think "flatten" might be part of my solution but I just can't work it out.
Imagine:
case class Thing (value1: Int, value2: Int)
case class Container (string1: String, listOfThings: List[Thing], string2: String)
So my list:
List[Container]
could be any size but for now we'll just have 3.
Inside each Container there is a list
listofthings[Thing]
that could also have number of type Thing in it, for now we'll also just have 3.
So what I want to get is something like
fullListOfThings[Thing] = List(Thing(1,1), Thing(1,2), Thing(1,3),
Thing(2,1), Thing(2,2), Thing(2,3), Thing(3,1), Thing(3,2), Thing(3,3))
The first value in Thing being it's Container number and the second value being the Thing number in that Container.
I hope all this makes sense.
To make it more complicated for me, my list of Container is not actually a list but rather an RDD,
RDD rddOfContainers[Container]
and what I need at the end is an RDD of Things
fullRddOfThings[Thing]
In the Java that I am more used to this would be pretty straight forward but Scala is different. I'm pretty new to Scala and am having to learn this on the fly so any full explanation would be very welcome.
I want to avoid bringing in too much external libraries if I can. In the mean time I'll keep reading. Thanks
Upvotes: 0
Views: 1867
Reputation: 5903
var list = rddOfContainers.flatMap(x => x.listOfThings).flatMap(y => y)
var rddOfThings = sc.parallelize(list)
Upvotes: 0
Reputation: 16308
Having RDD
as well any other proper scala collection, you could use flatMap
for such operations
val containers = sc.parallelize(Seq(
Container("",List(Thing(1,2), Thing(2,3)),""),
Container("", Nil,""),
Container("",List(Thing(3,4)),"")))
//containers: org.apache.spark.rdd.RDD[Container]
val things = containers flatMap (_.listOfThings)
//things: org.apache.spark.rdd.RDD[Thing]
things.collect()
//res2: Array[Thing] = Array(Thing(1,2), Thing(2,3), Thing(3,4))
Upvotes: 2