Spark parallelization on an array of dataset

Environment: Scala,spark,structured streaming

If i have an array of datasets in Scala Array(Dataset[Row]) which im interested in processing in parallel using a function that processes a Dataset[Row] object, is it enough to pass the array object through a map or foreach to utilize sparks parallelism in a spark cluster?

EDIT: i have stumbled on a few issues, i will re-explain the question in another thread

Upvotes: 1

Answers (2)

s.polam

Reputation: 10382

I have taken simple example to process in parallel.

You need to write function as per your requirement, Check below code for sample.

To see parallel effect try launch your spark shell using - spark-shell --master yarn --num-executors 3

scala> val dsList = Seq(Seq(1,2).toDS,Seq(4).toDS,Seq(7,9).toDS,Seq(10).toDS)

scala> dsList.par.reduce(_ union _).show(false)
+-----+
|value|
+-----+
|1    |
|2    |
|4    |
|7    |
|9    |
|10   |
+-----+

scala> dsList.par.foreach(_.show(false))
+-----+
|value|
+-----+
|1    |
|2    |
+-----+

+-----+
|value|
+-----+
|4    |
+-----+

+-----+
|value|
+-----+
|7    |
|9    |
+-----+

+-----+
|value|
+-----+
|10   |
+-----+

Upvotes: 1

Lars Skaug

Reputation: 1386

The simple answer to your question

is it enough to pass the array object through a map or foreach to utilize sparks parallelism in a spark cluster?

is no. For map and foreach to work, you must collect the data into one node.

Why not turn the Array(Dataset[Row]) into a Dataframe?

If you want to run a function on data row by row, you can use a UDF. Works like a charm!

Upvotes: 0

Spark parallelization on an array of dataset

Answers (2)

Related Questions