Reputation:
Could someone please help with some code in pyspark to loop over folders and subfolders to get the latest file.
The folder and subfolders are like below. Now I want to loop over to the latest year folder, and then latest month folder and then latest date folder to get the file.
Raw/2019
Raw/2020/06/21
Raw/2021/03/18/file.csv
Raw/2021/04/13/file.csv
Raw/2021/04/14/file.csv
Upvotes: 1
Views: 1366
Reputation: 5032
The best way to get the latest directory would be the File System API pltc mentioned
To add further , I have a small utility function that iterates the input path recursively leveraging the File System API and sorts them using getctime
-
def latest_dir(inp):
def recur_directory(inp,res=[]):
inp_dir = dbutils.fs.ls(inp)
for dr in inp_dir:
if os.path.isdir("/dbfs" + dr.path[5:]):
d = recur_directory(dr.path,res)
else:
res += [inp]
break
return res
dir_lst = [ recur_directory(x.path) for x in dbutils.fs.ls(inp[5:]) ][0]
return sorted(["/dbfs" + x[5:] for x in dir_lst], key=os.path.getctime,reverse=True)
Upvotes: 0
Reputation: 6082
You wouldn't need Spark to do that. Since you're using Azure Databricks, you should use Databricks File System API instead, so something like this
lst = dbutils.fs.ls("dbfs:/Raw/")
print(sorted(lst, reverse=True)[0])
Upvotes: 1