Carltonp
Carltonp

Reputation: 1344

How to filter a Spark dataframe based on datestamp of the file

Can someone let me know how filter on datestamp on file

I have the following files in their respective folders in Azure Data Lake:

adl://carlslake.azuredatalakestore.net/folderOne/filenr1_1166_2018-12-20%2006-05-52.csv

adl://carlslake.azuredatalakestore.net/folderTwo/filenr2_1168_2018-12-22%2006-07-31.csv

I have written the following script that will read all .csv files in both folders, but I only want to read .csv files in their respective folders based on current date.

test1 = spark.read.csv("adl://carlslake.azuredatalakestore.net/folderOne/",inferSchema=True,header=True)
test2 = spark.read.csv("adl://carlslake.azuredatalakestore.net/folderTwo/",inferSchema=True,header=True)

Can someone let me know how to tweak the above the read files on the folders based on current date e.g. the two .csv files are 2018-12-20 and 2018-12-22

I thought it might have been written something like

test1 = spark.read.csv("adl://carlslake.azuredatalakestore.net/folderOne/", select(current_date)inferSchema=True,header=True)

But that didn't work

Upvotes: 0

Views: 1144

Answers (2)

Mariano Billinghurst
Mariano Billinghurst

Reputation: 87

Just go with

test1 = spark.read.csv("adl://carlslake.azuredatalakestore.net/testfolder/RAW/*{today}.csv"

The other pattern *_{today}*.csv was not matching your file example above filenr1_1166_2018-12-20%2006-05-52.csv

Upvotes: 1

Mikhail Berlinkov
Mikhail Berlinkov

Reputation: 1624

Try something like

from datetime import datetime

today = datetime.today().date()
test1 = spark.read.csv(f"adl://carlslake.azuredatalakestore.net/
                       folderOne/*_{today}*.csv")

Upvotes: 1

Related Questions