Reputation: 351
Environment: Scala,spark,structured streaming
If i have an array of datasets in Scala Array(Dataset[Row])
which im interested in processing in parallel using a function that processes a Dataset[Row]
object, is it enough to pass the array object through a map or foreach to utilize sparks parallelism in a spark cluster?
EDIT: i have stumbled on a few issues, i will re-explain the question in another thread
Upvotes: 1
Views: 1030
Reputation: 10382
I have taken simple example to process in parallel.
You need to write function as per your requirement, Check below code for sample.
To see parallel effect try launch your spark shell using - spark-shell --master yarn --num-executors 3
scala> val dsList = Seq(Seq(1,2).toDS,Seq(4).toDS,Seq(7,9).toDS,Seq(10).toDS)
scala> dsList.par.reduce(_ union _).show(false)
+-----+
|value|
+-----+
|1 |
|2 |
|4 |
|7 |
|9 |
|10 |
+-----+
scala> dsList.par.foreach(_.show(false))
+-----+
|value|
+-----+
|1 |
|2 |
+-----+
+-----+
|value|
+-----+
|4 |
+-----+
+-----+
|value|
+-----+
|7 |
|9 |
+-----+
+-----+
|value|
+-----+
|10 |
+-----+
Upvotes: 1
Reputation: 1386
The simple answer to your question
is it enough to pass the array object through a map or foreach to utilize sparks parallelism in a spark cluster?
is no. For map and foreach to work, you must collect the data into one node.
Why not turn the Array(Dataset[Row]) into a Dataframe?
If you want to run a function on data row by row, you can use a UDF. Works like a charm!
Upvotes: 0