Reputation: 551
I have azure account storage with json
files, partitioned by year/month/day/hour. I need to list all of jsons between two dates, eg. 20200505 to 20201220, so I have list of url/dir
. I do not need to load any content, just to list all files, which lives between these two dates.
I need to use for it azure databricks with pyspark. Is it possible to just use sth like:
.load(from "<Path>/y=2020/month=05/day=05/**/*.json" to "<Path>/y=2020/month=12/day=20/**/*.json")
Here is structure of azure account storage:
Upvotes: 1
Views: 209
Reputation: 42352
Spark does not provide a generic way of selecting an interval of date partitions, but you can try to specify the ranges manually as below:
.load(
"<Path>/year=2020/month=05/day={0[5-9],[1-3][0-9]}/**/*.json",
"<Path>/year=2020/month={0[6-9],1[0-1]}/day=[0-3][0-9]/**/*.json",
"<Path>/year=2020/month=12/day={[0-1][0-9],20}/**/*.json",
)
Upvotes: 2