Pratibha Baghare
Pratibha Baghare

Reputation: 21

While reading single CSV file converting into multiple stages in spark

while reading any csv, it is always converting into 3 stages whether csv file has small size or big or only it has headers in file. and there is always three jobs that has one stage per job. and my application has no any transformation and action.It is only loading csv.

public class WordCount {

public static void main(String[] args) throws InterruptedException {
    SparkSession spark = SparkSession.builder().appName("Java Spark 
       Application").master("local").getOrCreate();
    Dataset<Row> df = spark.read()
            .format("com.databricks.spark.csv")
            .option("inferschema", "true")
            .option("header", "true")
            .load("/home/ist/OtherCsv/EmptyCSV.csv");
    spark.close();
}}

Spark UI images:

  1. three jobs in spark UI
  2. stages relates info
  3. all three stages have same dag visualization
  4. and all three jobs have same dag visualization
  5. and this is event timeline

Questions:

  1. why loading or reading csv always split into exactly three stages and three jobs.
  2. why it is converting into three jobs when there is no any action?
  3. how stages are formed in code level?

Upvotes: 0

Views: 1580

Answers (1)

varikollu naresh
varikollu naresh

Reputation: 1

By default csv,json and parquet will create 2 jobs but if we enable inferSchema for csv file then it will create 3jobs.

Upvotes: 0

Related Questions