SunTech
SunTech

Reputation: 59

What is difference between calling someRDD.collect.foreach(println) vs someRDD.foreach(println)

I have created an RDD from a csv file when i am calling rdd.collect.foreach(println) it returns file as it is but rdd.foreach(println) returns merged output. There are two partions on the RDD. val sc = new SparkContext("local[*]", "WordCount")

    val cities = sc.textFile("C:/Users/PSKUMARBEHL/Desktop/us_cities.csv")
      cities.collect.foreach(println)
      cities.foreach(println)
      println(cities.partitions.length)

Upvotes: 2

Views: 2608

Answers (1)

Assaf Mendelson
Assaf Mendelson

Reputation: 13001

The two are fundamentally different.

cities.collect.foreach(println)

first does collect which brings all records in cities back to the driver and then (since it is an array) prints each line. This means you have no parallelism as you are bringing everything to the driver.

cities.foreach(println)

on the other hand is a parallel operation. It means to run the function println on each record in the cities RDD. This occurs at the workers. Had you been using a real cluster (as opposed to local master), you would not be seeing the println as they occur on the worker.

Upvotes: 6

Related Questions