Shiv
Shiv

Reputation: 149

Why sc.parallelize(List) values sorted different at each execution

I've just parallelized list values as RDD and tried printing it on spark-shell. It prints the value in different in sorting every time. As far I know, It's because of nature of RDD and how it stores data. However, I would like to have it sorted in the same way ever time and how do I achieve it.

scala> val num1=sc.parallelize(List(1,2,3))
num1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> num1.foreach(println)
3
1
2

scala> num1.foreach(println)
1
2
3

scala> num1.foreach(println)
2
3
1

scala> num1.foreach(println)
2
3
1

scala> num1.foreach(println)
1
3
2

Upvotes: 0

Views: 232

Answers (1)

Reactormonk
Reactormonk

Reputation: 21700

There's two different things at play here - collection order and order of side effects (println here). The collection order is stable, you should get the same list back each time you call collect on it. However, if you call foreach, the order isn't guaranteed, because spark doesn't give you any guarantees on any specific function ordering. So if you just care about collection order, don't fret it. However, if you care about effect ordering, you might have to collect first, and then run everything on the local machine - which kinda defeats the whole purpose of spark.

Upvotes: 3

Related Questions