himanshuIIITian
himanshuIIITian

Reputation: 6085

Why spark.read.parquet() runs 2 jobs?

I have a parquet file, named test.parquet. It contains some integers. When I read it using following code:

val df = spark.read.parquet("test.parquet")

df.show(false)

+---+
|id |
+---+
|11 |
|12 |
|13 |
|14 |
|15 |
|16 |
|17 |
|18 |
|19 |
+---+

In logs it shows 2 jobs that were executed. They are as follows:

enter image description here

One is parquet job and another one is show job. Whereas, when I read parquet file using following code:

val df = spark.read.schema(StructType(List(StructField("id",LongType,false)))).parquet("test.parquet")

df.show(false)

+---+
|id |
+---+
|11 |
|12 |
|13 |
|14 |
|15 |
|16 |
|17 |
|18 |
|19 |
+---+

Only one job is executed, i.e., show:

enter image description here

So, my question is:

  1. Why first approach executes 2 jobs whereas second approach executes only one ?
  2. And, why second approach is faster than the first one ?

Upvotes: 2

Views: 1531

Answers (1)

Ishan Kumar
Ishan Kumar

Reputation: 1982

Spark reads the file twice. 1- To evolve the schema 2- To create the dataFrame

Once the schema will be generated, the dataFrame will be created which is fast.

Upvotes: 6

Related Questions