user16344431
user16344431

Reputation:

Why do I see two jobs in Spark UI for a single read?

I am trying to run the below script to load file with 24k records. Is there any reason why I am seeing two jobs for single load in Spark UI.

code


from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("DM")\
    .getOrCreate()


trades_df = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("s3://bucket/source.csv") 

trades_df.rdd.numPartitions() is 1

Spark UI Image

Upvotes: 2

Views: 604

Answers (1)

Mohana B C
Mohana B C

Reputation: 5487

That's because spark reads the csv file two times since you enabled inferSchema.

Read the comments for the function def csv(csvDataset: Dataset[String]): DataFrame on spark's github repo here.

Upvotes: 1

Related Questions