Reputation: 524
Quite complex algorith is being applied to list of Spark Dataset's rows (list was obtained using groupByKey and flatMapGroups). Most rows are transformed 1 : 1 from input to output, but in some scenarios require more than one output per each input. The input row schema can change anytime. The map()
fits the requirements quite well for the 1:1 transformation, but is there a way to use it producing 1 : n output?
The only work-around I found relies on foreach
method which has unpleasant overhed cause by creating the initial empty list (remember, unlike the simplified example below, real-life list structure is changing randomly).
My original problem is too complex to share here, but this example demonstrates the concept. Let's have a list of integers. Each should be transformed into its square value and if the input is even it should also transform into one half of the original value:
val X = Seq(1, 2, 3, 4, 5)
val y = X.map(x => x * x) //map is intended for 1:1 transformation so it works great here
val z = X.map(x => for(n <- 1 to 5) (n, x * x)) //this attempt FAILS - generates list of five rows with emtpy tuples
// this work-around works, but newX definition is problematic
var newX = List[Int]() //in reality defining as head of the input list and dropping result's tail at the end
val za = X.foreach(x => {
newX = x*x :: newX
if(x % 2 == 0) newX = (x / 2) :: newX
})
newX
Is there a better way than foreach
construct?
Upvotes: 0
Views: 806
Reputation: 4928
.flatMap
produces any number of outputs from a single input.
val X = Seq(1, 2, 3, 4, 5)
X.flatMap { x =>
if (x % 2 == 0) Seq(x*x, x / 2) else Seq(x / 2)
}
#=> Seq[Int] = List(0, 4, 1, 1, 16, 2, 2)
In X.map(f)
, f
is a function that maps each input to a single output. By contrast, in X.flatMap(g)
, the function g
maps each input to a sequence of outputs. flatMap
then takes all the sequences produced (one for each element in f
) and concatenates them.
The neat thing is .flatMap
works not just for sequences, but for all sequence-like objects. For an option, for instance, Option(x)#flatMap(g)
will allow g
to return an Option
. Similarly, Future(x)#flatMap(g)
will allow g
to return a Future.
Whenever the number of elements you return depends on the input, you should think of flatMap
.
Upvotes: 3