How to read the data of the last 3 days from a folder with parquet files?

Question

I have a folder with many parquet files that have names as follows:

user_2018-03-15_checked_products.parquet
user_2018-03-15_unchecked_products.parquet
user_2018-03-14_checked_products.parquet
user_2018-03-14_unchecked_products.parquet
user_2018-03-13_checked_products.parquet
user_2018-03-13_unchecked_products.parquet
user_2018-03-12_checked_products.parquet
user_2018-03-12_unchecked_products.parquet

I read all files as follows:

val df = spark.read.parquet("path/to/folder")

The folder contains 100 Gb of data and its size is growing incrementally. But I need to read only the data for the last 3 days. Currently, I read the whole folder and then apply filter? Is it possible to use some kind of mask in order to select only those file names that belong to the last 3 days instead of reading the whole folder?

koiralo · Accepted Answer

You can read all the file names and filter the file that is within 3 days as.

val listOfFiles = ??? // read all the files names 

val filteredFile = listOfFiles.filter( file => {
  val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
  val fileDate =  dateFormat.parse(file.split("_")(1))  //get date from file name 
  val currentDate = dateFormat.parse(dateFormat.format(new Date())) // current date
  val days = Days.daysBetween(new LocalDateTime(fileDate), new LocalDateTime(currentDate)).getDays
  //difference in days

  if (days <= 3 && days >= 0) true else false
})

Now read the list of filtered files as

spark.read.parquet(filteredFile: _*)

If require append the paths.

Hope this helps!

How to read the data of the last 3 days from a folder with parquet files?

Answers (1)

Related Questions