Reputation: 65
I have saved the data in warehouse in parquet file format with partition by date type column.
I try to get last N days data from the current date using scala spark.
The file data in saved like as below as warehouse path.
Tespath/filename/dt=2020-02-01
Tespath/filename/dt=2020-02-02
...........
Tespath/filename/dt=2020-02-28
If i read all the data its very hug amount of data.
Upvotes: 1
Views: 1395
Reputation: 476
As your dataset is correctly partitioned using the parquet format, you just need to read the directory Testpath/filename
and let Spark do the partition discovery.
It will add a dt
column in your schema with the value from the path name : dt=<value>
.This value can be used to filter your dataset and Spark will optimize the read by partition pruning all directory which does not match you predicate on the dt
column.
You could try something like this :
import spark.implicits._
import org.apache.spark.functions._
val df = spark.read.parquet("Testpath/filename/")
.where($"dt" > date_sub(current_date(), N))
You need to ensure spark.sql.parquet.filterPushdown
is set to true (which is default)
Upvotes: 3