Reputation: 3782
I have a folder with many parquet files that have names as follows:
user_2018-03-15_checked_products.parquet
user_2018-03-15_unchecked_products.parquet
user_2018-03-14_checked_products.parquet
user_2018-03-14_unchecked_products.parquet
user_2018-03-13_checked_products.parquet
user_2018-03-13_unchecked_products.parquet
user_2018-03-12_checked_products.parquet
user_2018-03-12_unchecked_products.parquet
I read all files as follows:
val df = spark.read.parquet("path/to/folder")
The folder contains 100 Gb of data and its size is growing incrementally. But I need to read only the data for the last 3 days. Currently, I read the whole folder and then apply filter
? Is it possible to use some kind of mask in order to select only those file names that belong to the last 3 days instead of reading the whole folder?
Upvotes: 0
Views: 1469
Reputation: 23109
You can read all the file names and filter
the file that is within 3 days as.
val listOfFiles = ??? // read all the files names
val filteredFile = listOfFiles.filter( file => {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
val fileDate = dateFormat.parse(file.split("_")(1)) //get date from file name
val currentDate = dateFormat.parse(dateFormat.format(new Date())) // current date
val days = Days.daysBetween(new LocalDateTime(fileDate), new LocalDateTime(currentDate)).getDays
//difference in days
if (days <= 3 && days >= 0) true else false
})
Now read the list of filtered files as
spark.read.parquet(filteredFile: _*)
If require append the paths.
Hope this helps!
Upvotes: 1