What is difference between calling someRDD.collect.foreach(println) vs someRDD.foreach(println)

Question

I have created an RDD from a csv file when i am calling rdd.collect.foreach(println) it returns file as it is but rdd.foreach(println) returns merged output. There are two partions on the RDD. val sc = new SparkContext("local[*]", "WordCount")

    val cities = sc.textFile("C:/Users/PSKUMARBEHL/Desktop/us_cities.csv")
      cities.collect.foreach(println)
      cities.foreach(println)
      println(cities.partitions.length)

Assaf Mendelson · Accepted Answer

The two are fundamentally different.

cities.collect.foreach(println)

first does collect which brings all records in cities back to the driver and then (since it is an array) prints each line. This means you have no parallelism as you are bringing everything to the driver.

cities.foreach(println)

on the other hand is a parallel operation. It means to run the function println on each record in the cities RDD. This occurs at the workers. Had you been using a real cluster (as opposed to local master), you would not be seeing the println as they occur on the worker.

What is difference between calling someRDD.collect.foreach(println) vs someRDD.foreach(println)

Answers (1)

Related Questions