Why spark.read.parquet() runs 2 jobs?

Question

I have a parquet file, named test.parquet. It contains some integers. When I read it using following code:

val df = spark.read.parquet("test.parquet")

df.show(false)

+---+
|id |
+---+
|11 |
|12 |
|13 |
|14 |
|15 |
|16 |
|17 |
|18 |
|19 |
+---+

In logs it shows 2 jobs that were executed. They are as follows:

One is parquet job and another one is show job. Whereas, when I read parquet file using following code:

val df = spark.read.schema(StructType(List(StructField("id",LongType,false)))).parquet("test.parquet")

df.show(false)

+---+
|id |
+---+
|11 |
|12 |
|13 |
|14 |
|15 |
|16 |
|17 |
|18 |
|19 |
+---+

Only one job is executed, i.e., show:

So, my question is:

Why first approach executes 2 jobs whereas second approach executes only one ?
And, why second approach is faster than the first one ?

Ishan Kumar · Accepted Answer

Spark reads the file twice. 1- To evolve the schema 2- To create the dataFrame

Once the schema will be generated, the dataFrame will be created which is fast.

Why spark.read.parquet() runs 2 jobs?

Answers (1)

Related Questions