Reputation: 6085
I have a parquet file, named test.parquet
. It contains some integers. When I read it using following code:
val df = spark.read.parquet("test.parquet")
df.show(false)
+---+
|id |
+---+
|11 |
|12 |
|13 |
|14 |
|15 |
|16 |
|17 |
|18 |
|19 |
+---+
In logs it shows 2 jobs that were executed. They are as follows:
One is parquet
job and another one is show
job. Whereas, when I read parquet file using following code:
val df = spark.read.schema(StructType(List(StructField("id",LongType,false)))).parquet("test.parquet")
df.show(false)
+---+
|id |
+---+
|11 |
|12 |
|13 |
|14 |
|15 |
|16 |
|17 |
|18 |
|19 |
+---+
Only one job is executed, i.e., show
:
So, my question is:
Upvotes: 2
Views: 1531
Reputation: 1982
Spark reads the file twice. 1- To evolve the schema 2- To create the dataFrame
Once the schema will be generated, the dataFrame will be created which is fast.
Upvotes: 6