Reputation: 53806
To map a function agains all elements of an RDD it is required to first convert the RDD to an Array type using collect method :
scala> val x = sc.parallelize(List(List("a"), List("b"), List("c", "d")))
x: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> x.collect()
res0: Array[List[String]] = Array(List(a), List(b), List(c, d))
scala> x.flatMap(y => y)
res3: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[3] at flatMap at <console>:15
Are all operations on the Array type in above example "x" run in parallel ?
Upvotes: 2
Views: 3353
Reputation: 3260
In spark standalone applications (not REPL) you must change the order of operations.
first call flatMap
and then collect
.
According to spark documentation, flatMap
is a transformation and all transformations in Spark are lazy, in that they do not compute their results right away. the operation is delayed until a call of action methods like collect
.
After calling collect
spark parallelized all operation w.r.t flatMap
.
Upvotes: 0
Reputation: 170735
To map a function agains all elements of an RDD it is required to first convert the RDD to an Array type using collect method
No, it isn't. RDD has map
method.
Are all operations on the Array type in above example "x" run in parallel ?
There are no operations on the Array type in the above example. x
is still an RDD, you throw away the array created by x.collect()
. If you call x.collect().map(...)
or x.collect().flatMap(...)
instead, the operations are not run in parallel.
Generally speaking, Spark does not affect operations on arrays or Scala collections in any way; only operations on RDDs are ever run in parallel. Of course, you can use e.g. Scala parallel collections to parallelize computations within a single node, but this is unrelated to Spark.
Upvotes: 4