How to read last N number of last days from the current date in parquet

Question

I have saved the data in warehouse in parquet file format with partition by date type column.

I try to get last N days data from the current date using scala spark.

The file data in saved like as below as warehouse path.

Tespath/filename/dt=2020-02-01
Tespath/filename/dt=2020-02-02
...........
Tespath/filename/dt=2020-02-28

If i read all the data its very hug amount of data.

Nonontb · Accepted Answer

As your dataset is correctly partitioned using the parquet format, you just need to read the directory Testpath/filename and let Spark do the partition discovery.

It will add a dt column in your schema with the value from the path name : dt=.This value can be used to filter your dataset and Spark will optimize the read by partition pruning all directory which does not match you predicate on the dt column. You could try something like this :

import spark.implicits._
import org.apache.spark.functions._

val df = spark.read.parquet("Testpath/filename/")
  .where($"dt" > date_sub(current_date(), N))

You need to ensure spark.sql.parquet.filterPushdown is set to true (which is default)

How to read last N number of last days from the current date in parquet

Answers (1)

Related Questions