Reputation: 198188
In spark, we have two ways to operate a RDD.
One is make it as short as possible:
rdd.map(x => h(f(g(x))))
Another one is chain to make it more readable, like:
rdd.map(g).map(f).map(h)...
Personally I like the later one, which is more clear. But some guys worries the performance, they consider it the same as:
list.map(g).map(f).map(h)
and think there will be some immediate temporary RDDs during the chain, so they always use the former one.
Is that true? Is there any performance issue to use the chain one? I personally treat it like a Stream
and I don't think the two have much performance difference
Upvotes: 4
Views: 98
Reputation: 67065
These are pretty much the same thing as the code will be pipelined.
The first is obvious as to what will happen as you seem clear, however the chaining will result in the following (simplified):
MapPartitionsRDD(
MapPartitionsRDD(
MapPartitionsRDD(
rdd,
iter.map(g)),
iter.map(f)),
iter.map(h))
Simplifying further for visualization:
map(map(map(rdd,g),f),h)
Which when executed boils down to:
h(f(g(rddItem)))
Seem familiar? All it is is a chain of pipelined computes...brought to you by the joys of lazy evaluation.
You can see this through an example:
def f(x: Int) = {println(s"f$x");x}
def g(x: Int) = {println(s"g$x");x}
def h(x: Int) = {println(s"h$x");x}
val rdd = sc.makeRDD(1 to 3, 1)
rdd.map(x => h(f(g(x))))
g1
f1
h1
g2
f2
h2
g3
f3
h3
rdd.map(g).map(f).map(h)
g1
f1
h1
g2
f2
h2
g3
f3
h3
Upvotes: 3