blue-sky
blue-sky

Reputation: 53806

Are Scala functions that run on an Spark Array parallelized?

To map a function agains all elements of an RDD it is required to first convert the RDD to an Array type using collect method :

scala> val x = sc.parallelize(List(List("a"), List("b"), List("c", "d")))
x: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[1] at parallelize at <console>:12

scala> x.collect()
res0: Array[List[String]] = Array(List(a), List(b), List(c, d))

scala> x.flatMap(y => y)
res3: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[3] at flatMap at <console>:15

Are all operations on the Array type in above example "x" run in parallel ?

Upvotes: 2

Views: 3353

Answers (2)

Majid Hajibaba
Majid Hajibaba

Reputation: 3260

In spark standalone applications (not REPL) you must change the order of operations.

first call flatMapand then collect.

According to spark documentation, flatMapis a transformation and all transformations in Spark are lazy, in that they do not compute their results right away. the operation is delayed until a call of action methods like collect. After calling collectspark parallelized all operation w.r.t flatMap.

Upvotes: 0

Alexey Romanov
Alexey Romanov

Reputation: 170735

To map a function agains all elements of an RDD it is required to first convert the RDD to an Array type using collect method

No, it isn't. RDD has map method.

Are all operations on the Array type in above example "x" run in parallel ?

There are no operations on the Array type in the above example. x is still an RDD, you throw away the array created by x.collect(). If you call x.collect().map(...) or x.collect().flatMap(...) instead, the operations are not run in parallel.

Generally speaking, Spark does not affect operations on arrays or Scala collections in any way; only operations on RDDs are ever run in parallel. Of course, you can use e.g. Scala parallel collections to parallelize computations within a single node, but this is unrelated to Spark.

Upvotes: 4

Related Questions