gorrch
gorrch

Reputation: 551

List json files, partitioned by year/month/day, from Azure account storage using PySpark

I have azure account storage with json files, partitioned by year/month/day/hour. I need to list all of jsons between two dates, eg. 20200505 to 20201220, so I have list of url/dir. I do not need to load any content, just to list all files, which lives between these two dates.

I need to use for it azure databricks with pyspark. Is it possible to just use sth like:

.load(from "<Path>/y=2020/month=05/day=05/**/*.json" to "<Path>/y=2020/month=12/day=20/**/*.json")

Here is structure of azure account storage: enter image description here

Upvotes: 1

Views: 209

Answers (1)

mck
mck

Reputation: 42352

Spark does not provide a generic way of selecting an interval of date partitions, but you can try to specify the ranges manually as below:

.load(
    "<Path>/year=2020/month=05/day={0[5-9],[1-3][0-9]}/**/*.json",
    "<Path>/year=2020/month={0[6-9],1[0-1]}/day=[0-3][0-9]/**/*.json",
    "<Path>/year=2020/month=12/day={[0-1][0-9],20}/**/*.json",
)

Upvotes: 2

Related Questions